In this study, we evaluated two commonly used RNA library protocols for FFPE samples: RNA exome capture and rRNA-depletion, using seven paired FFPE-FFzn samples. Samples processed using the RNA exome capture protocol showed a higher percentage of gene mapped reads, captured a higher number of canonical junctions, generated better SNP concordance rate and demonstrated better concordance with TruSeq PolyA data. Next, we sought to identify pre-sequencing metrics that could be used to predict sample pass/fail status based on post-sequencing bioinformatics metrics. All study samples along with replicate samples were processed using the RNA exome protocol. Three bioinformatics metrics were determined to identify qc-failed samples, including sample-wise median correlation (median_cor), number of gene mapped reads (gene_reads), number of detectable genes with counts per million (CPM) larger than 2 (gene_cpm2). Finally, a decision tree-based model was built to examine the relationship between pre-sequencing lab metrics and qc-status as defined by post-sequencing bioinformatics metrics. Based on the model, we recommend a minimum of 25ng/ul for RNA concentration and 1.8 ng/ul for pre-capture library concentration for FFPE samples to generate good quality RNA-seq data for bioinformatics analysis. We also demonstrated that FFPE replicates have similar reproducibility compared to FFzn replicates across sequencing batches. However, genes with short length or high GC content are more likely to be influenced by the FFPE procedure.
Clinical biospecimens are typically stored as FFPE blocks, representing an invaluable source of material for biomedical research. FFPE blocks enable prolonged storage of clinical samples, preserving both tissue morphology and nucleic acids information. However, FFPE processing and tissue storage have been shown to affect RNA quality, thus limiting gene expression quantification by technologies like RNA sequencing. Our study provides a guideline for future research that utilizes FFPE samples for RNA-seq. By following these recommendations, sequencing samples with RNA and library input higher than our recommended values will not only help yield a better success rate for RNA sequencing, but also help to prevent unnecessary cost for sequencing.
There are several limitations for our study. Firstly, we benchmarked two commonly used library preparation protocols for FFPE samples using bioinformatics metrics, including SNP confirmation rate. SNP confirmation rate (precision) was calculated as the percentage of true SNPs (called by WES data) within the SNPs identified by RNA-seq for the same sample. This does not consider RNA specific mutations introduced by events like RNA editing. However, RNA editing events are considered very rare and the expected SNP confirmation rate should be very close to our calculation in Additional file 1 [12]. Secondly, when performing bioinformatics QC using replicate samples, due to the limited number of replicate samples with median_cor around 0.7 and 0.8 range, we arbitrarily selected a cutoff value (0.75) around the inflection point of the loess-fitted curve between median_cor and FPR. This criterion will potentially affect our definition of qc pass/fail as determined by those bioinformatics metrics. To provide the user with more flexibility in selecting cutoffs for those bioinformatics metrics, we have provided a documentation that enables the end-user to define customized cutoffs based on their preference of stringency: https://github.com/Liuy12/FFPEinput. Thirdly, the concentration of RNA in the original samples is highly dependent on the amount of input tissue, original handling and storage of the sample, the extraction method used, and perhaps most importantly, the elution volume used following extraction and purification. It is difficult to compare these amounts across samples or studies unless all these factors are controlled. The library concentrations are more comparable since they are based on a consistent total RNA amount going into the library prep. Other than RNA and library input metrics, we also investigated other pre-sequencing lab metrics including DV50, DV100, DV200 values. The recommended quantities of starting FFPE material according to the vendor corresponds to a range of DV200 values, with the lowest recommended quality at DV200 of 30–50%. Recommendations for input using DV50 or DV100 values has not been evaluated by the vendor. Due to the RNA input limit, we were only able to quantify around 70% of all study samples for DV metrics. Based on those limited data, we observed that DV50 is highly correlated with DV100 values. Both DV50 and DV100 have moderate correlation with DV200, a conventional metric for measuring RNA quality. DV50 value is identified as the top predictive feature for sample failure using a recursive feature elimination algorithm. However, including DV50 in building the decision tree model does not lead to improved performance compared to using RNA/library input metrics alone. We suspect that this could be due to the decreased sample size with available DV values.