HIT-scISOseq design
Droplet-based single-cell RNA sequencing, as performed with the 10x Genomics Chromium system, is usually used as a scalable solution for full-length cDNA library construction in ScISOr-Seq. The 10x Genomics system uses microfluidic partitioning to capture mRNA in single cells and then prepares barcoded, full-length cDNA libraries for the PacBio sequencing platform (Figure S1). The 10x system combines template-switching oligonucleotides (TSOs) and reverse transcription reactions to prepare small volume cDNA libraries, which can lead to 40–50% of the library being composed of barcode-less TSO artifacts (Fig. 1b, Table 1). Sequencing these artifact reads reduces the throughput of useful CCS reads by 50%. To remove artifacts, a biotinylated PCR primer that hybridizes to the desired cDNAs is constructed in the HIT-scISOseq system; the cDNAs can then be biotinylated during PCR amplification and captured using streptavidin beads (Figs. 1a). The capture step can significantly reduce the percentage of TSO artifacts from 50%, as observed when using the standard ScISOr-Seq method to 8% when using HIT-scISOseq (Fig. 1b).
Table 1
Performance of ScISOr-Seq and HIT-scISOseq according to throughput and accuracy in two replicate samples of cynomolgus monkey corneal limbus cells.
| | ScISOr-Seq | | Linked-scISOseq | | HIT-scISOseq |
| Sample | s1 | s2 | s1 | s1 | s2 |
Raw Data | Polymerase reads count (M) | 4.95 | 4.30 | | 5.02 | | 4.74 | 5.69 |
Yield of polymerase reads (GB) | 499.77 | 415.52 | | 365.12 | | 383.74 | 438.64 |
Avg. polymerase reads length (Kb) | 101.06 | 96.66 | | 72.78 | | 80.88 | 77.06 |
Yield of subreads (GB) | 487.53 | 405.99 | | 361.45 | | 379.87 | 434.44 |
Avg. subreads length (Kb) | 1.55 | 1.68 | | 3.64 | | 3.46 | 3.61 |
CCS Reads | CCS reads count (M) | 4.02 | 3.38 | | 3.70 | | 3.43 | 4.23 |
Yield of CCS reads (GB) | 8.04 | 7.12 | | 16.56 | | 16.75 | 21.62 |
Avg. CCS reads length (Kb) | 2.00 | 2.11 | | 4.48 | | 4.89 | 5.11 |
Avg. CCS reads passes | 70 | 64 | | 21 | | 23 | 20 |
Avg. CCS reads QV | 0.97 | 0.97 | | 0.95 | | 0.95 | 0.95 |
FLNC Detection | Linked cDNA count (M) | 3.44 | 2.90 | | 11.57 | | 11.64 | 14.84 |
FLNC count (M) | 1.60 | 1.29 | | 5.25 | | 10.47 | 13.23 |
NFL count (M) | 0.07 | 0.06 | | 0.14 | | 0.28 | 0.39 |
Artifact RNA count (M) | 1.76 | 1.55 | | 6.18 | | 0.88 | 1.22 |
FLNC percentage (%) | 46.59 | 44.53 | | 45.34 | | 89.99 | 89.15 |
NFL percentage (%) | 2.10 | 2.03 | | 1.24 | | 2.43 | 2.66 |
Artifact RNA percentage (%) | 51.31 | 53.44 | | 53.42 | | 7.58 | 8.20 |
FLNC Mapping | Mapped FLNC Count (M) | 1.59 | 1.29 | | 5.20 | | 10.34 | 13.05 |
Mapped FLNC percentage (%) | 99.44 | 99.50 | | 99.05 | | 98.75 | 98.65 |
Avg. FLNC mapping coverage (%) | 99.13 | 99.14 | | 98.88 | | 98.90 | 98.83 |
Avg. FLNC mapping identity (%) | 98.45 | 98.39 | | 97.60 | | 97.74 | 97.59 |
Avg. collapsed FLNC length (Kb) | 2.37 | 2.47 | | 2.18 | | 2.22 | 2.24 |
For raw data, the rows show (from top to bottom): (i) total polymerase read count (million) for each sample; (ii) sum of all polymerase read bases (gigabase) for each sample; (iii) average polymerase read length (kilobase) of each sample; (iv) sum of all subread bases (gigabase) in each sample; and (v) average subread length (kilobase) of each sample. |
For CCS reads, the rows show (from top to bottom): (i) total CCS read count (million) for each sample; (ii) sum of all CCS read bases (gigabase) in each sample; (iii) average CCS read length (kilobase) of each sample; (iv) average CCS read passes in each sample; (v) average CCS read QV (Phred 33) in each sample. |
For FLNC detection, the rows show (from top to bottom): (i) total linked cDNA (defined as linked cDNA in each CCS read) count (million) in each sample; (ii) total full-length non-concatemer (FLNC) read count (million) in each sample; (iii) total non-full length (NFL) read count (million) in each sample; (iv) total artifact cDNA count (million) in each sample; (v) percentage of FLNC in linked cDNAs of each sample; (vi) percentage of NFL in linked cDNAs of each sample; and (vii) percentage of artifact cDNAs in linked cDNAs of each sample. |
For FLNC mapping, the rows show (from top to bottom): (i) total mapped FLNC count (million) of each sample; (ii) percentage of mapped FLNC in total FLNC of each sample; (iii) average mapping coverage of mapping FLNC in total FLNC of each sample; (iv) average mapping identity of mapping FLNC in total FLNC of each sample; and (v) average collapsed FLNC reads (defined as the reads after mapping quality filtering and collapsing of redundancy) length (kilobase) in each sample. |
Another significant barrier that limits the throughput of CCS read yields is the short insert size of the SMRTbell library in Sequel II systems. The current CCS system allows 10–15 kb SMRTbell libraries to be employed for HiFi long-read sequencing, in which a single DNA polymerase enzyme is affixed to the bottom of a zero-mode waveguide (ZMW) nanoscale well with a single molecule of DNA as a template. For genome sequencing, PacBio recommends a library insert size of 9–15 kb for the Sequel II system. However, the short length of cDNA transcripts (1.5 kb on average) limits the library insert size, which is incompatible with the long-read capacity of ZMW nanoscale wells. Previous studies have used Gibson Assembly or Golden Gate Assembly to ligate target short or mid-sized DNA fragments (ConcatSeq: ~200bp, DeCatCounter: ~870bp) into long SMRTbell libraries for PacBio sequencing [21, 22]; however, these methods show a low throughput, and there are no corresponding reports on full-length cDNA concatenation. Currently, methods for ligating and sequencing full-length isoforms of uneven length at the whole-transcriptome level are still lacking.
Currently, the throughput of ScISOr-Seq is only 20%-30% of that of gDNA sequencing (Table 1). To match the capacity of the ZMW nanoscale well in the PacBio Sequel II system, we hypothesized that a long-insert SMRTbell template could be created by linking multiple cDNA inserts together for downstream Sequel II CCS sequencing. For HIT-scISOseq, a palindromic sequence included at both ends of the primer was designed for the second round of PCR, and the USER enzyme was employed to generate sticky ends (Fig. 1a). Multiple cDNAs were joined using DNA ligase in a head-to-tail fashion. After HIT-scISOseq was preprinted on the bioRxiv [23], the application of the USER enzyme was modified for MAS-ISO-seq to generate sequential array structure sticky ends to ligate cDNAs into ~ 15 kb sequences, but this required the cDNA of each sample to be divided into 15 tubes for PCR amplification, which increased the number of experimental steps and the complexity [24].
HIT-scISOseq can be used in a widely accessible droplet-based 10x Genomics Chromium system and provide essential isoform information. HIT-scISOseq can achieve a remarkable increase in the number of mapped full-length reads, up to eight times the number produced by the standard ScISOr-seq method (Fig. 1c, g). The present study demonstrates that this approach is reproducible and readily adaptable to high-throughput single-cell isoform sequencing applications.
Performance Of Hit-scisoseq Sequencing Runs
We sought to compare sequencing read outputs among different library preparation methods using the same PacBio Sequel II instrument and Sequel II SMRT Cell 8M; the evaluated methods included: ScISOr-Seq (Figure S1), Linked-scISOseq (Figure S2), and HIT-scISOseq (Fig. 1a & Figure S3). Among these methods, ScISOr-Seq is a standard library preparation method (not involving capture and concatenation procedures). Linked-scISOseq only includes a full-length cDNA concatenation procedure and no TSO artifact removal step. Comparisons of the three methods allowed the relative performance of each procedure to be assessed. The performance of the three methods was directly compared using the same limbal epithelium RNA samples, whose transcriptional profiles had previously been well-characterized. For each sample, two replicates (s1 and s2) were performed. A total of five SMRT Cells were sequenced on the PacBio Sequel II system, as s1 samples were only used for Linked-scISOseq. The libraries were sequenced following the Iso-Seq sample preparation protocol, with the recommended loading concentrations (Supplementary Table 1).
The computational analysis of concatenated full-length cDNAs requires special attention to be given to the physical proximity of multiple cDNAs and the random 5’-to-3’ direction. Therefore, an isoform data analysis pipeline (scISA-Tools, see method sections) was developed to identify and quantify poly(A) tails, cell barcodes (cellBC), unique molecular identifiers (UMI), and the assignment of reads to cells and RNA molecules. Based on PacBio’s recommended Iso-Seq data processing procedure, the mapped cDNAs were further classified as full-length non-chimeric (FLNC), non-full-length (NFL), or artifact reads, based on the presence of a poly(A) tail signal and the 5’ and 3’ cDNA primers. Reads refer to those with neither the 3’ primer nor the poly(A) tail were referred to as artifact reads.
The performance assessment could be roughly divided into four process elements: raw polymerase reads, CCS reads, FLNC reads, and mapped FLNC reads (Table 1). All three methods yielded similar amounts of raw polymerase reads (ranging from 4.30 to 5.69 M), while the percentage of productive ZMWs (P1 percentage metric) ranged from 53.75–71.13%. The similarity of the polymerase read yields among the three methods suggested that the SMRTbell cDNA templates produced by all three methods were of high quality. Furthermore, the average polymerase read length under all three methods was > 70 kb, suggesting good quality in the instrument runs. PacBio suggests that a minimum of three full passes of a long insert (typically peaking at 4.9 kb to 5.2 kb in the current study, Figure S5) are needed to produce reads with an accuracy greater than 0.9 (default requirement for CCS sequencing analysis). The average number of full passes was > 20 for both Linked-scISOseq and HIT-scISOseq, indicating high consensus accuracy (> Q20) of both methods (Fig. 1d). This high consensus accuracy allowed us to demultiplex HIT-scISOseq reads based on 10x Genomics cellular barcodes, and it was found that > 93% of the HIT-scISOseq FLNC reads could be successfully assigned to individual cells with a CCS QV > = 0.95 (Supplementary Tables 3). To the best of the authors’ current knowledge, the method outlined above achieved the highest cell barcode demultiplexing rate. Notably, the polymerase read lengths obtained via Linked-scISOseq and HIT-scISOseq were only 70% of those found under standard ScISOr-Seq (Table 1, Figure S6). This occurred because damage to the linked inserts in the SMRTbell libraries may hamper the polymerase reaction (Table 1). Therefore, the polymerase read yields generated from Linked-scISOseq and HIT-scISOseq are relatively lower than those from ScISOr-Seq.
All three methods also generated a similar number of CCS reads, ranging from 3.38 M to 4.23 M. The number of CCS reads was positively associated with the polymerase read count (Table 1). Both Linked-scISOseq and HIT-scISOseq generated longer average CCS read lengths (4.48 kb, Linked-scISOseq; 4.89 kb for the s1 sample and 5.11 kb for the s2 sample, HIT-scISOseq), which were more than double those generated by ScISOr-seq (Table 1, Figure S6). The average CCS read lengths were similar to the average concatenated full-length transcript lengths during library construction (Table 1, Figure S6). All three methods produced similar CCS read QV values (> 0.95), suggesting that high-quality CCS reads could be generated using long insert SMRTbell templates obtained via the ligation of multiple cDNA fragments.
The linked cDNA reads were demultiplexed to generate FLNC reads by using scISA-Tools (see Methods section). Both Linked-scISOseq and HIT-scISOseq generated a larger number of cDNAs than ScISOr-Seq. Notably, HIT-scISOseq produced a much lower number of artifact cDNA reads (7.58%, s1 sample; 8.20%, s2 sample) than Linked-scISOseq (53.42%) and ScISOr-Seq (51.31%, s1 sample; 53.44%, s2 sample). This result indicates that the capture procedure in HIT-scISOseq effectively removes the majority of artifact reads resulting from TSO-flanked fragments during library construction, ultimately increasing the final cDNA read yield. When artifact reads were excluded, the net FLNC read percentages (FLNC/(NFL + FLNC)) obtained were quite similar among the three methods.
HIT-scISOseq generated a greater number of mapped FLNC reads than ScISO-Seq and Linked-scISOseq did. After aligning the PacBio reads against the monkey reference genome, HIT-scISOseq produced the greatest number of mapped reads (10.34 M, s1 sample; 13.05 M, s2 sample), representing 6x and 10x more reads per SMRT Cell compared to the ScISOr-Seq results for s1 and s2, respectively, and up to 2x more reads compared to the Linked-scISOseq. The above-mentioned capture and concatenation procedures increased the mapped FLNC reads by factors of two and four, respectively, with a combined 8-fold increase in yield. The number of single-cell genes and UMI detection levels of HIT-scISOseq were markedly higher than those of ScISOr-Seq (Fig. 1g). Despite the observed differences in read yield, the three methods showed similar mappability of FLNC reads. More than 98% and 99% of the FLNC reads from HIT-scISOseq and standard ScISOr-Seq, respectively, were mappable (Table 1). These mappability results again confirmed the high quality of the FLNC reads. The average mapping coverage of FLNC reads was > 98%, and the average mapping identity values were > 97% for both Linked-scISOseq and HIT-scISOseq, which are comparable to the values generated by ScISOr-Seq (Table 1). These mapping metrics confirm the robustness of the scISA-Tools pipeline for precise read alignment.
Although the median FLNC length of HIT-scISOseq was shorter than that of ScISOr-Seq, HIT-scISOseq could still cover the range of FLNC length obtained via ScISOr-Seq (Fig. 1e); in addition, the average lengths of collapsed reads (transcripts) obtained from HIT-scISOseq were comparable to those from ScISOr-Seq (Table 1, Figure S6). Although, HIT-scISOseq favors shorter cDNAs, but this does not skew the gene expression profile compared to ScISOr-Seq (Figure S9). Additionally, two biological replicate data sets generated from via HIT-scISOseq were compared, which revealed that the results for the two replicates were very similar. The subtle differences in read yield metrics between biological replicates may be due to the differences in the percentage of productive ZMW loading and in sample quality. In addition, by extending the reaction time of the USER enzyme and T4 DNA ligase, we found that we were able to ligate cDNAs longer. Combined with the latest PacBio polymerase binding kit (which is suitable for libraries above 3kb), HIT-scISOseq was able to obtain FLNCs up to 30M (Supplementary Tables 2, Figure S7). These results indicate that HIT-scISOseq still has room for optimization and enhancement.
Hit-scisoseq Assigns Cell Barcodes With High Accuracy
The accurate demultiplexing of HIT-scISOseq concatenated reads into single-cell full-length isoforms is a central factor in determining the accuracy of cell barcodes. The use of palindromic end adapter sequences to ligate a variable number of cDNAs in HIT-scISOseq, makes the segmentation of concatenated reads difficult. By enumerating the possible forms of ligation between two cDNAs (Fig. 2a), we found that the correct segmentation of the FLNC depends on the combination of adapters for two cases (5p + 3p- and 3p + 5p-). Accordingly, although HIT-scISOseq lacks a sequential array structure similar to that of MAS-ISO-seq, scISA-Tools can still segment concatenated reads accurately.
To evaluate the accuracy and efficiency of concatenated read demultiplexing and cell-barcode assignment, we amplified the SIRV Set4 synthesis RNA isoforms with “AAGTCCTTCCAGTCTT + 12N” barcode labeled PCR primers, which was 1 edit distance from the most similar 10x whitelist barcode. After double-strand cDNA synthesis, we added 0.1 ng of barcoded SIRV cDNA to 99 ng of cDNA from a 10x Genomics human-mouse cell line mixture cDNA for use as a known cell for HIT-scISOseq library preparation. After demultiplexing HIT-scISOseq concatenated reads, we used mapped FLNC from SIRV and human-mouse mixture to accurately calculate their corresponding TP, FP, TN, and FN values (Fig. 2d) for barcode detection, which allowed us to calculate the accuracy and specificity of the barcode detection. As shown in Fig. 2b-c, scISA-Tools could achieve 99.997% and 99.998% barcode assignment accuracy and specificity, respectively (FLNC QV cut off > = 0.95, which means that 1 mismatch was allowed in a 16bp barcode). The mixed human-mouse data further confirmed that demultiplexing and barcode assignment were accurate (Fig. 2d-h).
Hit-scisoseq Gene Expression Clustering Of Corneal Limbus Single Cells Into Cell Types
Single-cell RNA sequencing has been widely used to quantify gene expression and to identify distinct cell types. To validate the ability of HIT-scISOseq to distinguish different cell types, HIT-scISOseq and Illumina short-read RNA sequencing (NGS) were compared using the same single-cell 10x Genomics limbal epithelium cDNA samples, which consisted of several well-defined cell types, and a strong concordance was identified between the two platforms. Gene expression based on HIT-scISOseq data were quantified using the scISA-tools pipeline. There was a strong correlation of the UMI counts by cellBC (Pearson's r > 0.990, p < 0.001, Figs. 3a & S10a) and the UMI counts by gene (Pearson's r > 0.950, p < 0.001, Figs. 3b & S10b) between the HIT-scISOseq and NGS platforms. There was also a high concordance in UMI counts by gene in the HIT-scISOseq data generated from the two biological replicates (Pearson's r = 0.998, p < 0.001, Fig. 3c). Moreover, UMAP projection of gene expression data from the two platforms showed consistent results in terms of cell-type classification (four cell clusters, Figs. 3d-e & S10c-d, Supplementary Table 6–7) with clear cell type boundaries, including conjunctival cells, limbal stem cells, central basal cells, and differentiated cells. The barcoding consistency of the top-ranked 2000 cells between NGS and HIT-scISoSeq was 99% (Figure S8). The gene expression values obtained for the same cell type showed a high correlation (Pearson's r > 0.95, Figs. 3g & S10f) between NGS and HIT-scISOseq, with the percentage of shared cell barcodes for the same cell type being > 99% (Figs. 3f & S10f, Supplementary Table 6). The high concordance of cellBC counts suggests that HIT-scISOseq can accurately identify the transcriptomes of cells isolated with the 10x Genomics system. It was then possible to create heatmaps of the top 15 marker genes for each cell cluster (Figs. 3h-i & S10g-h). The expression of marker genes also showed similar patterns between the two platforms. These results confirm that the gene expression data derived from HIT-scISOseq are comparable to those derived from the NGS gene platform.
Hit-scisoseq Captures Single-cell Isoform Expression In Corneal Limbus
To verify that HIT-scISOseq can accurately quantify isoform expression, we first used SIRV to assess that isoform detection is not confounded by isoform sequence similarity. We performed isoform identification confusion matrix calculations using HIT-scISOseq SIRV isoform data, which showed confusion rates as low as 0.1066% (Fig. 4b). We evaluated the isoform quantification results by comparing the observed values obtained via HIT-scISOseq with known ERCC isoform abundance data. The abundance measured by HIT-scISOseq was highly consistent with the underlying facts with a correlation coefficient of 0.97 (Fig. 4a).
Next, we hypothesized that it would be possible to identify and quantify single-cell isoforms using the HIT-scISOseq data set. After SQANTI3 quality control and artifact filtering of the corneal limbus data, we retained only four main types of isoforms according to SQANTI3 classification: FSM, ISM, NIC, and NNC. Finally, we retained 29392 and 31793 isoforms from the s1 and s2 samples, respectively (Supplementary Table 5). Figure 4c shows that at the single-cell level, FSM was the most abundant isoform type in both samples, and there were a considerable number of NNC isoforms, indicating that our data may be used to improve the reference annotation.
Based on isoform-level expression, we identified the same cell clustering pattern by gene level expression clustering analysis (Figs. 4d). In addition, isoform-level expression was strongly correlated between the two biological replicate samples (Figure S11). The top 15 marker isoforms for each cell cluster were further analyzed and some of these isoforms were found to be previously unidentified (Figs. 4e). This result suggests that HIT-scISOseq can reveal more complete isoforms in single cells. We further selected 2 marker isoforms associated with marker genes in each cell type for expression pattern verification. The dot plot and feature plot showed that these marker isoforms presented highly cell-type-specific expression (Fig. 4f-g), supporting the capability of HIT-scISOseq to capture single-cell isoform expression in the corneal limbus.
To validate the cell-type specific isoforms detected by HIT-scISOseq, especially based on sequencing-free methods (e.g., qPCR), we chose corneal basal cells (denoted as B) and conjunctival cells (denoted as Cj) as validation samples because they can be sampled from two separated regions in the ocular surface (Figure S12 a). We selected four cell-type specific isoforms in B and Cj for qPCR validation, respectively, and their corresponding genes were expressed in both clusters (Figure S12 b-c). The qPCR results showed expression patterns consistent with the HIT-scISOseq results (Figure S12 e).
Hit-scisoseq Revealed Cell-type-specific Isoform Expression Changes In The Corneal Limbus
Next, we mined isoform-driven expression changes between different cell types. To rule out the effect of gene expression, we first applied the ‘FindAllMarkers’ function in the Seurat R package to gene and isoform expression matrices. Then, cell-type-specific differentially expressed isoform were selected (Supplementary Table 9) under the following conditions: the avg_log2FC value of the up/down-regulated isoforms (p-value < 0.01) must be at least 2-fold higher or lower than the avg_log2FC (p-value < 0.01) of their associated genes. We finally obtained 158 isoforms with cell-type-specific differential expression (Fig. 5b, Supplementary Table 8–9), which represented 147 genes (Fig. 5a, Supplementary Table 10). These genes were enriched in pathways such as cell adhesion regulation, tissue morphogenesis, epithelial cell differentiation, epithelial cell proliferation, and skin development, etc. which are highly related to corneal epithelial cell proliferation and differentiation (Fig. 5c).
We further selected 4 isoforms showing cell-type-specific differentially expressed associated with 4 genes (ITM2B, DUSP1, B2M, and HOPX) whose expression patterns differed significantly between cell types (Fig. 5e). Figure 5e shows that the expression of these genes is driven by the major isoforms, and the expression pattern of the changed isoforms is inconsistent with the expression pattern of their corresponding genes and major isoforms (Fig. 5e, f-h). The exon structures of these expression-changed isoforms differ significantly from the reference annotation and the major isoform (Fig. 5d), possibly indicating that these isoforms play different functional roles than the major isoform. Previous studies have shown that these 4 genes are associated with neurogenerative diseases [25], tumor resistance [26], hypercatabolic hypoproteinemia [27], and cardiac development [28], but they have rarely been reported in the corneal limbus. This finding have not previously been verified independently by low-throughput isoform sequencing or NGS based scRNAseq.