HIT-scISOseq: High-throughput and High-accuracy Single-cell Full-length Isoform Sequencing

doi:10.21203/rs.3.rs-114035/v1

Download PDF

Article

HIT-scISOseq: High-throughput and High-accuracy Single-cell Full-length Isoform Sequencing

https://doi.org/10.21203/rs.3.rs-114035/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 06 May, 2023

Read the published version in Nature Communications →

Version 1

posted

You are reading this latest preprint version

Although long-read single-cell isoform sequencing (scISO-Seq) can reveal transcriptomic dynamics in individual cells invisible to NGS-based single-cell RNA analysis, scISO-Seq has been limited by low throughput, high error rates, and long running time.Here, we introduce HIT-scISOseq, the first method that concatenates multiple full-length cDNAs for PacBio circular consensus sequencing (CCS) sequencing to achievehigh-throughput, and high-accuracy single-cell isoform sequencing. HIT-scISOseq can yield >10 million high-accuracy full-length isoforms in a single PacBio Sequel II SMRT Cell 8M. We have developed scISA-Tools that demultiplex HIT-scISOseq concatenated reads into single-cell full-length isoforms with >99.99% accuracy and specificity. We have applied HIT-scISOseq to characterize the transcriptome of thousands of corneal limbus cells, and reveal cell-type-specific isoform expression changes that are previously not identified by NGS-based scRNAseq. HIT-scISOseq is a high-throughput, high-accuracy, and technically accessible method that can be used by most laboratories to accelerate the burgeoning field of long-read single-cell transcriptomics.

Single-cell

full-length isoform sequencing

PacBio SMRTbell

cDNA concatenation

Single-cell RNA sequencing (scRNA-Seq) technologies can resolve expression heterogeneity across different cell types and states and have been widely used in fields involving complex biological and pathological processes, such as developmental biology, oncology, neuroscience, and immunology[1–5]. While these methods are accurate and cost-effective, current next generation sequencing (NGS)-based scRNA-Seq [6] technologies can only be used for the quantitative analysis of gene expression and cannot reveal complex transcript isoforms [7]. Recently, Multiple microfluidics-based [8–13] and well-based approaches[14, 15] based single-cell isoform RNA-Seq (ScISOr-Seq[16]) has been developed by combined the 10x Genomics platform and single-molecule long-read sequencing, which often use the PacBio or Oxford Nanopore sequencing technologies. Long-read sequencing platforms have also been used to investigate poly(A)-tail characteristics [17, 18]. Long-read sequencing techniques enable single-cell isoform RNA-Seq to identify full-length mRNA isoforms before characterizing alternative splicing, gene fusion, and sequence diversity.

However, existing single-cell isoform RNA-Seq methods suffer from a low read throughput for two principal reasons. First, existing methods with the 10x Genomics single-cell preparation pipeline introduces a high proportion (approximately 50%) of undesirable cell-barcode-free reads (template-switching oligonucleotide (TSO) artifacts) during library construction [12]. The sequencing of these artifacts results in waste corresponding to 50% of sequencing resources [12, 19]. Second, the long-read sequencing technologies themselves also limit throughput. Although the Nanopore PromethION platform can generate > 100 million raw reads, the overall cell-barcode demultiplex efficiency was much lower than that of PacBio technology[13] has due to the relatively high error rate of the raw reads. Additionally, ScISOr-Seq has shown that the PacBio circular consensus sequencing (CCS) platform presents a low false-positive rate for cell barcodes (< 0.01%) and high isoform mapping accuracy [16], but, the throughput of the PacBio CCS system remains low because the conventional CCS SMRTbell library preparation procedure can only construct a short-insert (1.5 kb on average) cDNA library for a typical human transcript. These short inserts are misaligned with PacBio’s HiFi long-read sequencing capacity (9–15 kb per CCS)[20], thus impairing the sequencing outputs of the Sequel II system. A previous ScISOr-Seq study showed that a total of 11 Sequel SMRT Cells 1M were needed to generate 5.2 million reads to characterize 6,000 single cells [16]. The low yields of existing ScISOr-Seq methods lead to high costs and prevent their usage in large-scale applications.

To overcome the throughput limitations of ScISOr-Seq, we developed a high-throughput single-cell isoform sequencing method (HIT-scISOseq) (Fig. 1a, Figure S3) as a strategy for high-throughput, high-accuracy single-cell RNA isoform sequencing. We proposed two steps to overcome the limitations of current ScISOr-Seq technology: a TSO artifact removal step, and a cDNA concatenation step. In the TSO artifact removal step, HIT-scISOseq used a PCR-based biotin-assisted capture procedure to remove TSO artifacts and enrich single-cell, full-length cDNA sequences. In the cDNA concatenation step, we proposed a novel library preparation procedure aimed at concatenating multiple full-length cDNAs into one long SMRTbell insert to match the PacBio HiFi long-read sequencing capacity, which significantly increases the consensus reads for the full-length isoform output. HIT-scISOseq can generate > 10 million full-length cDNAs with a single Sequel II SMRT Cell 8M, representing eight times the yield of a standard ScISOr-Seq approach. We demonstrated HIT-scISOseq by analyzing transcriptomes from thousands of corneal limbus cells and identified cell-type-specific isoform expression.

HIT-scISOseq design

Droplet-based single-cell RNA sequencing, as performed with the 10x Genomics Chromium system, is usually used as a scalable solution for full-length cDNA library construction in ScISOr-Seq. The 10x Genomics system uses microfluidic partitioning to capture mRNA in single cells and then prepares barcoded, full-length cDNA libraries for the PacBio sequencing platform (Figure S1). The 10x system combines template-switching oligonucleotides (TSOs) and reverse transcription reactions to prepare small volume cDNA libraries, which can lead to 40–50% of the library being composed of barcode-less TSO artifacts (Fig. 1b, Table 1). Sequencing these artifact reads reduces the throughput of useful CCS reads by 50%. To remove artifacts, a biotinylated PCR primer that hybridizes to the desired cDNAs is constructed in the HIT-scISOseq system; the cDNAs can then be biotinylated during PCR amplification and captured using streptavidin beads (Figs. 1a). The capture step can significantly reduce the percentage of TSO artifacts from 50%, as observed when using the standard ScISOr-Seq method to 8% when using HIT-scISOseq (Fig. 1b).

Table 1

Performance of ScISOr-Seq and HIT-scISOseq according to throughput and accuracy in two replicate samples of cynomolgus monkey corneal limbus cells.
		ScISOr-Seq		Linked-scISOseq	HIT-scISOseq
	Sample	s1	s2	s1	s1	s2
Raw Data	Polymerase reads count (M)	4.95	4.30	5.02	4.74	5.69
	Yield of polymerase reads (GB)	499.77	415.52	365.12	383.74	438.64
	Avg. polymerase reads length (Kb)	101.06	96.66	72.78	80.88	77.06
	Yield of subreads (GB)	487.53	405.99	361.45	379.87	434.44
	Avg. subreads length (Kb)	1.55	1.68	3.64	3.46	3.61
CCS Reads	CCS reads count (M)	4.02	3.38	3.70	3.43	4.23
	Yield of CCS reads (GB)	8.04	7.12	16.56	16.75	21.62
	Avg. CCS reads length (Kb)	2.00	2.11	4.48	4.89	5.11
	Avg. CCS reads passes	70	64	21	23	20
	Avg. CCS reads QV	0.97	0.97	0.95	0.95	0.95
FLNC Detection	Linked cDNA count (M)	3.44	2.90	11.57	11.64	14.84
	FLNC count (M)	1.60	1.29	5.25	10.47	13.23
	NFL count (M)	0.07	0.06	0.14	0.28	0.39
	Artifact RNA count (M)	1.76	1.55	6.18	0.88	1.22
	FLNC percentage (%)	46.59	44.53	45.34	89.99	89.15
	NFL percentage (%)	2.10	2.03	1.24	2.43	2.66
	Artifact RNA percentage (%)	51.31	53.44	53.42	7.58	8.20
FLNC Mapping	Mapped FLNC Count (M)	1.59	1.29	5.20	10.34	13.05
	Mapped FLNC percentage (%)	99.44	99.50	99.05	98.75	98.65
	Avg. FLNC mapping coverage (%)	99.13	99.14	98.88	98.90	98.83
	Avg. FLNC mapping identity (%)	98.45	98.39	97.60	97.74	97.59
	Avg. collapsed FLNC length (Kb)	2.37	2.47	2.18	2.22	2.24
For raw data, the rows show (from top to bottom): (i) total polymerase read count (million) for each sample; (ii) sum of all polymerase read bases (gigabase) for each sample; (iii) average polymerase read length (kilobase) of each sample; (iv) sum of all subread bases (gigabase) in each sample; and (v) average subread length (kilobase) of each sample.
For CCS reads, the rows show (from top to bottom): (i) total CCS read count (million) for each sample; (ii) sum of all CCS read bases (gigabase) in each sample; (iii) average CCS read length (kilobase) of each sample; (iv) average CCS read passes in each sample; (v) average CCS read QV (Phred 33) in each sample.
For FLNC detection, the rows show (from top to bottom): (i) total linked cDNA (defined as linked cDNA in each CCS read) count (million) in each sample; (ii) total full-length non-concatemer (FLNC) read count (million) in each sample; (iii) total non-full length (NFL) read count (million) in each sample; (iv) total artifact cDNA count (million) in each sample; (v) percentage of FLNC in linked cDNAs of each sample; (vi) percentage of NFL in linked cDNAs of each sample; and (vii) percentage of artifact cDNAs in linked cDNAs of each sample.
For FLNC mapping, the rows show (from top to bottom): (i) total mapped FLNC count (million) of each sample; (ii) percentage of mapped FLNC in total FLNC of each sample; (iii) average mapping coverage of mapping FLNC in total FLNC of each sample; (iv) average mapping identity of mapping FLNC in total FLNC of each sample; and (v) average collapsed FLNC reads (defined as the reads after mapping quality filtering and collapsing of redundancy) length (kilobase) in each sample.

Another significant barrier that limits the throughput of CCS read yields is the short insert size of the SMRTbell library in Sequel II systems. The current CCS system allows 10–15 kb SMRTbell libraries to be employed for HiFi long-read sequencing, in which a single DNA polymerase enzyme is affixed to the bottom of a zero-mode waveguide (ZMW) nanoscale well with a single molecule of DNA as a template. For genome sequencing, PacBio recommends a library insert size of 9–15 kb for the Sequel II system. However, the short length of cDNA transcripts (1.5 kb on average) limits the library insert size, which is incompatible with the long-read capacity of ZMW nanoscale wells. Previous studies have used Gibson Assembly or Golden Gate Assembly to ligate target short or mid-sized DNA fragments (ConcatSeq: ~200bp, DeCatCounter: ~870bp) into long SMRTbell libraries for PacBio sequencing [21, 22]; however, these methods show a low throughput, and there are no corresponding reports on full-length cDNA concatenation. Currently, methods for ligating and sequencing full-length isoforms of uneven length at the whole-transcriptome level are still lacking.

Currently, the throughput of ScISOr-Seq is only 20%-30% of that of gDNA sequencing (Table 1). To match the capacity of the ZMW nanoscale well in the PacBio Sequel II system, we hypothesized that a long-insert SMRTbell template could be created by linking multiple cDNA inserts together for downstream Sequel II CCS sequencing. For HIT-scISOseq, a palindromic sequence included at both ends of the primer was designed for the second round of PCR, and the USER enzyme was employed to generate sticky ends (Fig. 1a). Multiple cDNAs were joined using DNA ligase in a head-to-tail fashion. After HIT-scISOseq was preprinted on the bioRxiv [23], the application of the USER enzyme was modified for MAS-ISO-seq to generate sequential array structure sticky ends to ligate cDNAs into ~ 15 kb sequences, but this required the cDNA of each sample to be divided into 15 tubes for PCR amplification, which increased the number of experimental steps and the complexity [24].

HIT-scISOseq can be used in a widely accessible droplet-based 10x Genomics Chromium system and provide essential isoform information. HIT-scISOseq can achieve a remarkable increase in the number of mapped full-length reads, up to eight times the number produced by the standard ScISOr-seq method (Fig. 1c, g). The present study demonstrates that this approach is reproducible and readily adaptable to high-throughput single-cell isoform sequencing applications.

Performance Of Hit-scisoseq Sequencing Runs

We sought to compare sequencing read outputs among different library preparation methods using the same PacBio Sequel II instrument and Sequel II SMRT Cell 8M; the evaluated methods included: ScISOr-Seq (Figure S1), Linked-scISOseq (Figure S2), and HIT-scISOseq (Fig. 1a & Figure S3). Among these methods, ScISOr-Seq is a standard library preparation method (not involving capture and concatenation procedures). Linked-scISOseq only includes a full-length cDNA concatenation procedure and no TSO artifact removal step. Comparisons of the three methods allowed the relative performance of each procedure to be assessed. The performance of the three methods was directly compared using the same limbal epithelium RNA samples, whose transcriptional profiles had previously been well-characterized. For each sample, two replicates (s1 and s2) were performed. A total of five SMRT Cells were sequenced on the PacBio Sequel II system, as s1 samples were only used for Linked-scISOseq. The libraries were sequenced following the Iso-Seq sample preparation protocol, with the recommended loading concentrations (Supplementary Table 1).

The computational analysis of concatenated full-length cDNAs requires special attention to be given to the physical proximity of multiple cDNAs and the random 5’-to-3’ direction. Therefore, an isoform data analysis pipeline (scISA-Tools, see method sections) was developed to identify and quantify poly(A) tails, cell barcodes (cellBC), unique molecular identifiers (UMI), and the assignment of reads to cells and RNA molecules. Based on PacBio’s recommended Iso-Seq data processing procedure, the mapped cDNAs were further classified as full-length non-chimeric (FLNC), non-full-length (NFL), or artifact reads, based on the presence of a poly(A) tail signal and the 5’ and 3’ cDNA primers. Reads refer to those with neither the 3’ primer nor the poly(A) tail were referred to as artifact reads.

The performance assessment could be roughly divided into four process elements: raw polymerase reads, CCS reads, FLNC reads, and mapped FLNC reads (Table 1). All three methods yielded similar amounts of raw polymerase reads (ranging from 4.30 to 5.69 M), while the percentage of productive ZMWs (P1 percentage metric) ranged from 53.75–71.13%. The similarity of the polymerase read yields among the three methods suggested that the SMRTbell cDNA templates produced by all three methods were of high quality. Furthermore, the average polymerase read length under all three methods was > 70 kb, suggesting good quality in the instrument runs. PacBio suggests that a minimum of three full passes of a long insert (typically peaking at 4.9 kb to 5.2 kb in the current study, Figure S5) are needed to produce reads with an accuracy greater than 0.9 (default requirement for CCS sequencing analysis). The average number of full passes was > 20 for both Linked-scISOseq and HIT-scISOseq, indicating high consensus accuracy (> Q20) of both methods (Fig. 1d). This high consensus accuracy allowed us to demultiplex HIT-scISOseq reads based on 10x Genomics cellular barcodes, and it was found that > 93% of the HIT-scISOseq FLNC reads could be successfully assigned to individual cells with a CCS QV > = 0.95 (Supplementary Tables 3). To the best of the authors’ current knowledge, the method outlined above achieved the highest cell barcode demultiplexing rate. Notably, the polymerase read lengths obtained via Linked-scISOseq and HIT-scISOseq were only 70% of those found under standard ScISOr-Seq (Table 1, Figure S6). This occurred because damage to the linked inserts in the SMRTbell libraries may hamper the polymerase reaction (Table 1). Therefore, the polymerase read yields generated from Linked-scISOseq and HIT-scISOseq are relatively lower than those from ScISOr-Seq.

All three methods also generated a similar number of CCS reads, ranging from 3.38 M to 4.23 M. The number of CCS reads was positively associated with the polymerase read count (Table 1). Both Linked-scISOseq and HIT-scISOseq generated longer average CCS read lengths (4.48 kb, Linked-scISOseq; 4.89 kb for the s1 sample and 5.11 kb for the s2 sample, HIT-scISOseq), which were more than double those generated by ScISOr-seq (Table 1, Figure S6). The average CCS read lengths were similar to the average concatenated full-length transcript lengths during library construction (Table 1, Figure S6). All three methods produced similar CCS read QV values (> 0.95), suggesting that high-quality CCS reads could be generated using long insert SMRTbell templates obtained via the ligation of multiple cDNA fragments.

The linked cDNA reads were demultiplexed to generate FLNC reads by using scISA-Tools (see Methods section). Both Linked-scISOseq and HIT-scISOseq generated a larger number of cDNAs than ScISOr-Seq. Notably, HIT-scISOseq produced a much lower number of artifact cDNA reads (7.58%, s1 sample; 8.20%, s2 sample) than Linked-scISOseq (53.42%) and ScISOr-Seq (51.31%, s1 sample; 53.44%, s2 sample). This result indicates that the capture procedure in HIT-scISOseq effectively removes the majority of artifact reads resulting from TSO-flanked fragments during library construction, ultimately increasing the final cDNA read yield. When artifact reads were excluded, the net FLNC read percentages (FLNC/(NFL + FLNC)) obtained were quite similar among the three methods.

HIT-scISOseq generated a greater number of mapped FLNC reads than ScISO-Seq and Linked-scISOseq did. After aligning the PacBio reads against the monkey reference genome, HIT-scISOseq produced the greatest number of mapped reads (10.34 M, s1 sample; 13.05 M, s2 sample), representing 6x and 10x more reads per SMRT Cell compared to the ScISOr-Seq results for s1 and s2, respectively, and up to 2x more reads compared to the Linked-scISOseq. The above-mentioned capture and concatenation procedures increased the mapped FLNC reads by factors of two and four, respectively, with a combined 8-fold increase in yield. The number of single-cell genes and UMI detection levels of HIT-scISOseq were markedly higher than those of ScISOr-Seq (Fig. 1g). Despite the observed differences in read yield, the three methods showed similar mappability of FLNC reads. More than 98% and 99% of the FLNC reads from HIT-scISOseq and standard ScISOr-Seq, respectively, were mappable (Table 1). These mappability results again confirmed the high quality of the FLNC reads. The average mapping coverage of FLNC reads was > 98%, and the average mapping identity values were > 97% for both Linked-scISOseq and HIT-scISOseq, which are comparable to the values generated by ScISOr-Seq (Table 1). These mapping metrics confirm the robustness of the scISA-Tools pipeline for precise read alignment.

Although the median FLNC length of HIT-scISOseq was shorter than that of ScISOr-Seq, HIT-scISOseq could still cover the range of FLNC length obtained via ScISOr-Seq (Fig. 1e); in addition, the average lengths of collapsed reads (transcripts) obtained from HIT-scISOseq were comparable to those from ScISOr-Seq (Table 1, Figure S6). Although, HIT-scISOseq favors shorter cDNAs, but this does not skew the gene expression profile compared to ScISOr-Seq (Figure S9). Additionally, two biological replicate data sets generated from via HIT-scISOseq were compared, which revealed that the results for the two replicates were very similar. The subtle differences in read yield metrics between biological replicates may be due to the differences in the percentage of productive ZMW loading and in sample quality. In addition, by extending the reaction time of the USER enzyme and T4 DNA ligase, we found that we were able to ligate cDNAs longer. Combined with the latest PacBio polymerase binding kit (which is suitable for libraries above 3kb), HIT-scISOseq was able to obtain FLNCs up to 30M (Supplementary Tables 2, Figure S7). These results indicate that HIT-scISOseq still has room for optimization and enhancement.

Hit-scisoseq Assigns Cell Barcodes With High Accuracy

The accurate demultiplexing of HIT-scISOseq concatenated reads into single-cell full-length isoforms is a central factor in determining the accuracy of cell barcodes. The use of palindromic end adapter sequences to ligate a variable number of cDNAs in HIT-scISOseq, makes the segmentation of concatenated reads difficult. By enumerating the possible forms of ligation between two cDNAs (Fig. 2a), we found that the correct segmentation of the FLNC depends on the combination of adapters for two cases (5p + 3p- and 3p + 5p-). Accordingly, although HIT-scISOseq lacks a sequential array structure similar to that of MAS-ISO-seq, scISA-Tools can still segment concatenated reads accurately.

To evaluate the accuracy and efficiency of concatenated read demultiplexing and cell-barcode assignment, we amplified the SIRV Set4 synthesis RNA isoforms with “AAGTCCTTCCAGTCTT + 12N” barcode labeled PCR primers, which was 1 edit distance from the most similar 10x whitelist barcode. After double-strand cDNA synthesis, we added 0.1 ng of barcoded SIRV cDNA to 99 ng of cDNA from a 10x Genomics human-mouse cell line mixture cDNA for use as a known cell for HIT-scISOseq library preparation. After demultiplexing HIT-scISOseq concatenated reads, we used mapped FLNC from SIRV and human-mouse mixture to accurately calculate their corresponding TP, FP, TN, and FN values (Fig. 2d) for barcode detection, which allowed us to calculate the accuracy and specificity of the barcode detection. As shown in Fig. 2b-c, scISA-Tools could achieve 99.997% and 99.998% barcode assignment accuracy and specificity, respectively (FLNC QV cut off > = 0.95, which means that 1 mismatch was allowed in a 16bp barcode). The mixed human-mouse data further confirmed that demultiplexing and barcode assignment were accurate (Fig. 2d-h).

Hit-scisoseq Gene Expression Clustering Of Corneal Limbus Single Cells Into Cell Types

Single-cell RNA sequencing has been widely used to quantify gene expression and to identify distinct cell types. To validate the ability of HIT-scISOseq to distinguish different cell types, HIT-scISOseq and Illumina short-read RNA sequencing (NGS) were compared using the same single-cell 10x Genomics limbal epithelium cDNA samples, which consisted of several well-defined cell types, and a strong concordance was identified between the two platforms. Gene expression based on HIT-scISOseq data were quantified using the scISA-tools pipeline. There was a strong correlation of the UMI counts by cellBC (Pearson's r > 0.990, p < 0.001, Figs. 3a & S10a) and the UMI counts by gene (Pearson's r > 0.950, p < 0.001, Figs. 3b & S10b) between the HIT-scISOseq and NGS platforms. There was also a high concordance in UMI counts by gene in the HIT-scISOseq data generated from the two biological replicates (Pearson's r = 0.998, p < 0.001, Fig. 3c). Moreover, UMAP projection of gene expression data from the two platforms showed consistent results in terms of cell-type classification (four cell clusters, Figs. 3d-e & S10c-d, Supplementary Table 6–7) with clear cell type boundaries, including conjunctival cells, limbal stem cells, central basal cells, and differentiated cells. The barcoding consistency of the top-ranked 2000 cells between NGS and HIT-scISoSeq was 99% (Figure S8). The gene expression values obtained for the same cell type showed a high correlation (Pearson's r > 0.95, Figs. 3g & S10f) between NGS and HIT-scISOseq, with the percentage of shared cell barcodes for the same cell type being > 99% (Figs. 3f & S10f, Supplementary Table 6). The high concordance of cellBC counts suggests that HIT-scISOseq can accurately identify the transcriptomes of cells isolated with the 10x Genomics system. It was then possible to create heatmaps of the top 15 marker genes for each cell cluster (Figs. 3h-i & S10g-h). The expression of marker genes also showed similar patterns between the two platforms. These results confirm that the gene expression data derived from HIT-scISOseq are comparable to those derived from the NGS gene platform.

Hit-scisoseq Captures Single-cell Isoform Expression In Corneal Limbus

To verify that HIT-scISOseq can accurately quantify isoform expression, we first used SIRV to assess that isoform detection is not confounded by isoform sequence similarity. We performed isoform identification confusion matrix calculations using HIT-scISOseq SIRV isoform data, which showed confusion rates as low as 0.1066% (Fig. 4b). We evaluated the isoform quantification results by comparing the observed values obtained via HIT-scISOseq with known ERCC isoform abundance data. The abundance measured by HIT-scISOseq was highly consistent with the underlying facts with a correlation coefficient of 0.97 (Fig. 4a).

Next, we hypothesized that it would be possible to identify and quantify single-cell isoforms using the HIT-scISOseq data set. After SQANTI3 quality control and artifact filtering of the corneal limbus data, we retained only four main types of isoforms according to SQANTI3 classification: FSM, ISM, NIC, and NNC. Finally, we retained 29392 and 31793 isoforms from the s1 and s2 samples, respectively (Supplementary Table 5). Figure 4c shows that at the single-cell level, FSM was the most abundant isoform type in both samples, and there were a considerable number of NNC isoforms, indicating that our data may be used to improve the reference annotation.

Based on isoform-level expression, we identified the same cell clustering pattern by gene level expression clustering analysis (Figs. 4d). In addition, isoform-level expression was strongly correlated between the two biological replicate samples (Figure S11). The top 15 marker isoforms for each cell cluster were further analyzed and some of these isoforms were found to be previously unidentified (Figs. 4e). This result suggests that HIT-scISOseq can reveal more complete isoforms in single cells. We further selected 2 marker isoforms associated with marker genes in each cell type for expression pattern verification. The dot plot and feature plot showed that these marker isoforms presented highly cell-type-specific expression (Fig. 4f-g), supporting the capability of HIT-scISOseq to capture single-cell isoform expression in the corneal limbus.

To validate the cell-type specific isoforms detected by HIT-scISOseq, especially based on sequencing-free methods (e.g., qPCR), we chose corneal basal cells (denoted as B) and conjunctival cells (denoted as Cj) as validation samples because they can be sampled from two separated regions in the ocular surface (Figure S12 a). We selected four cell-type specific isoforms in B and Cj for qPCR validation, respectively, and their corresponding genes were expressed in both clusters (Figure S12 b-c). The qPCR results showed expression patterns consistent with the HIT-scISOseq results (Figure S12 e).

Hit-scisoseq Revealed Cell-type-specific Isoform Expression Changes In The Corneal Limbus

Next, we mined isoform-driven expression changes between different cell types. To rule out the effect of gene expression, we first applied the ‘FindAllMarkers’ function in the Seurat R package to gene and isoform expression matrices. Then, cell-type-specific differentially expressed isoform were selected (Supplementary Table 9) under the following conditions: the avg_log2FC value of the up/down-regulated isoforms (p-value < 0.01) must be at least 2-fold higher or lower than the avg_log2FC (p-value < 0.01) of their associated genes. We finally obtained 158 isoforms with cell-type-specific differential expression (Fig. 5b, Supplementary Table 8–9), which represented 147 genes (Fig. 5a, Supplementary Table 10). These genes were enriched in pathways such as cell adhesion regulation, tissue morphogenesis, epithelial cell differentiation, epithelial cell proliferation, and skin development, etc. which are highly related to corneal epithelial cell proliferation and differentiation (Fig. 5c).

We further selected 4 isoforms showing cell-type-specific differentially expressed associated with 4 genes (ITM2B, DUSP1, B2M, and HOPX) whose expression patterns differed significantly between cell types (Fig. 5e). Figure 5e shows that the expression of these genes is driven by the major isoforms, and the expression pattern of the changed isoforms is inconsistent with the expression pattern of their corresponding genes and major isoforms (Fig. 5e, f-h). The exon structures of these expression-changed isoforms differ significantly from the reference annotation and the major isoform (Fig. 5d), possibly indicating that these isoforms play different functional roles than the major isoform. Previous studies have shown that these 4 genes are associated with neurogenerative diseases [25], tumor resistance [26], hypercatabolic hypoproteinemia [27], and cardiac development [28], but they have rarely been reported in the corneal limbus. This finding have not previously been verified independently by low-throughput isoform sequencing or NGS based scRNAseq.

This study demonstrates that HIT-scISOseq is a high-throughput, highly accurate method that can be used to characterize isoforms in thousands of single cells. The PacBio Sequel II SMRT Cell 8M has allowed long insert reads (15 kb) to be used with high consensus accuracy (> 99.9% for HiFi reads). This study shows that the concatenation of multiple cDNAs into a long library can bridge the gap between short libraries and PacBio’s HiFi long-read sequencing capacity. Our experiments show that HIT-scISOseq can also ligate cDNAs into sequences of 15 kb or longer, however, the risk of irreversible cDNA damage (e.g., nicks) increases with the increasing length of multiple tandem cDNAs in the current HIT-scISOseq system, which can severely impair the performance of DNA polymerase in ZWMs. Therefore, after the ligation of cDNAs, the current version of the system has set ~ 5kb (by ligation of 3–4 full-length transcripts, Fig. 1f) as the appropriate length limit for preparing long-insert libraries at PCR step, because this moderate level of concatenation does not lead to a high percentage of nicks. In the future, it will be beneficial to explore methods of reducing DNA nicks, which will enable the construction of high-quality longer insert libraries with more cDNA concatemers, thus further improving the throughput of HIT-scISOseq. Moreover, using the BluePippin system to enrich for longer concatenated molecules generated via our method may be an alternative approach for increasing long-read yield.

Although this study focused on ligating mRNA transcripts by improving HIT-scISOseq to provide a universal DNA linking protocol, any sequence of interest (e.g., LncRNA, mRNA have full-length Poly(A) tail, circular RNA, 16S rRNA, and exon targeted DNA) can be targeted and enriched by altering the composition of concatenation libraries[17, 18]. Additionally, the high quality of full-length PacBio transcripts allows HIT-scISOseq to identify both transcriptional information and somatic mutations at the single-cell level and to reveal more detailed phasing of transcripts at the single-cell level and permits allele-specific expression (ASE) analysis[29]. HIT-scISOseq can also be used in the multiplexed single-cell RNA-sequencing of pooled unrelated individuals, in which natural polymorphisms in long transcripts can be utilized to demultiplex reads and recover sample identity[6]. Furthermore, although the present study demonstrated only that HIT-scISOseq is fully compatible with a commercially-available single-cell platform (10x Genomics), it should be readily adaptable to other microwell-based and combinatorial-indexing-based technologies.

Monkey limbal sampling experiment

All animal experiments were conducted following the ARVO Statement for the Use of Animals in Ophthalmic and Vision Research and were approved by the Institutional Animal Care and Use Committee of Zhongshan Ophthalmic Center, Sun Yat-sen University. Cynomolgus monkeys (Macaca fascicularis) were anesthetized using a mixture of ketamine and xylazine, and topical anesthesia consisted of 0.5% proparacaine hydrochloride (Alcaine; Alcon). Only female monkeys aged four years were used. Limbal excision was performed on the right eye, and the left eye was left undamaged. Limbal excision was conducted by lamellar dissection of the limbal zone, 2 mm into the cornea, 2 mm into the conjunctiva, and 100 µm in depth. Biopsy tissues were transferred to cryovials containing Advanced DMEM F-12 and were placed on ice.

Single-cell Dissociation

Dissected limbal tissue was micro-dissected and disaggregated into single cells using Dispase II (Sigma) and collagenase IV (Sigma) at 37°C under constant rotation. The epithelial layer was isolated from the underlying stroma and was separately digested at 37°C for 2 h using 2 mL of 1 mg mL^− 1 collagenase A (Sigma-Aldrich Corp., St. Louis, MO, USA) in Dulbecco's modified Eagle's medium (DMEM) containing 10% FBS, 50 µg mL^− 1 gentamicin, and 1.25 µg mL^− 1 amphotericin B. The clusters were further digested with 0.25% trypsin and 1 mM EDTA, with gentle pipetting to yield single cells. The cells were filtered through a 30-µm cell strainer and were re-suspended in 60 µL PBS containing 0.04% BSA to obtain a concentration of 1,000 cells µl^− 1 for capture on the 10x Genomics Chromium controller.

10x Genomics Single-cell Capture And Illumina Library Preparation

The dissociated single cells were processed on the GemCode Single Cell Platform per the manufacturer’s recommendations using the Chromium Single Cell 3’ GEM, Library, and Gel Bead Kit v3 (10x Genomics; PN-1000075) with a recovered quantity of approximately 2,000 cells. Illumina library preparation was performed using the Chromium Single Cell 3’ Reagent Kits User Guide (V3 Chemistry). After the cDNA cleanup step (Step 2.1), half of the purified cDNA was used for PacBio library preparation, and the rest was used for downstream Illumina library preparation. Illumina libraries were sequenced on a NextSeq 550 system (SY-415-1002, Illumina) by using NextSeq High Output Kits (150 cycles; 20024907, Illumina) with the following read protocol: read 1, 118 cycles; i7 index read, 8 cycles; read 2, 40 cycles.

Cdna Amplification And Capture For Pacbio Library Construction

Eighty nanograms of cDNA products were amplified using five PCR cycles by using KAPA HiFi HotStart Uracil 2 x ReadyMix (Kapa Biosystems) as well as newly designed PCR primers containing deoxyuracil, one of which was biotinylated.

Forward primer: 5’-Biotin-ACTAGUCTACACGACGCTCTTCCGATCT-3’

Reverse primer: 5’-ACTAGUAAGCAGTGGTATCAACGCAGAG − 3’

The PCR products were then purified using 0.8 volumes of Agencourt AMPure XP Beads (Beckman Coulter), quantified using Qubit dsDNA HS Assay Kits (Thermo Fisher), and assessed via Agilent 2100 DNA HS Assays (Figure S3). The barcode-UMI-poly (dT)-flanked cDNAs were captured on streptavidin-coated M-280 Dynabeads using Dynabeads™ kilobaseBINDER™ Kits (60101, Invitrogen, Carlsbad, CA), whereas the unbound cDNAs were removed.

User Cloning-based Ligation Of Multiple Inserts

Complementary DNA products on the Dynabeads were washed with wash buffer and nuclease-free water before being re-suspended in 19 µL reaction buffer containing 2 ul 10x T4 DNA ligase buffer (NEB) and 1 µl USER Enzyme (NEB). The products were then incubated at 37°C for 20 min, nicking the deoxyuracil site to generate 3’ palindrome overhangs suitable for the ligation of multiple inserts and simultaneously releasing the cDNA from the M-280 Dynabeads. The reaction tube was placed in a magnetic stand, and the supernatant was transferred to a new tube. One microliter of T4 DNA ligase (NEB, 400,000 U mL^− 1) was added to the reaction mixture, and the resulting mixture was incubated at 16°C for 10 min to ligate the inserts. The resultant multi-insert library was purified using 0.4 volume of Agencourt AMPure XP Beads (Beckman Coulter) and was then end-repaired and A-tailed using the NEBNext Ultra II End Repair/dA-Tailing Module, with incubation for 15 min at 20°C and then for 30 min at 65°C. The cDNA was ligated with 2 µl of a dT-overhang selection adapter (10 µM, annealed with primer 5’-GAACGACATGGCTACGATCCGACTT-3’ and 5’ PHO- AGTCGGATCGTAGCCATGTCGTTC-3’) by using the NEBNext® Ultra™ II Ligation Module (NEB) for 15 min at 20°C, before being purified with 0.4 volume of Agencourt AMPure XP Beads (Beckman Coulter). Then, 100 ng of the purified products was PCR-amplified for 8–9 cycles using KAPA HiFi HotStart 2x ReadyMix and a selection primer (5’PHO-GAACGACATGGCTACGATCCGACTT-3’) to screen the multi-insert library without ligation nicks. The amplified products were again purified using 0.4 volume of Agencourt AMPure XP Beads (Beckman Coulter) and were assayed using Agilent DNA 12000 Assays.

Pacbio Smrtbell Template Preparation And Sequencing

Amplified PCR products were end-repaired and A-tailed using the NEBNext End Repair/dA-Tailing Module, ligated with a dT-overhang hairpin adapter using the NEBNext® Ultra™ II Ligation Module (NEB), and purified with 0.4 volume of Agencourt AMPure XP Beads (Beckman Coulter) to produce the SMRTbell Template. To remove residual adapters and unligated DNA fragments, 1 µL exonuclease I (NEB), 1 µL exonuclease III (NEB), and NEBuffer 1 (NEB) were added to the library before incubation at 37°C for 1 h. The products were purified using 0.8 volume of Agencourt AMPure XP beads, eluted with 15 µL elution buffer (10 mM Tris-HCl, pH 8.0), and quantified using Agilent DNA 12000 Kits (Agilent). Sequencing primer annealing and polymerase binding to the PacBio SMRTbell Templates were performed according to the manufacturer’s recommendations (PacBio, US). The library complex was then sequenced using SMRT Cell 8M (PacBio), which was compatible with the Sequel II sequencer.

Hit-scisoseq Data Processing Pipeline

Since HIT-scISOseq links multiple transcripts together; and multiple cDNA-library-prep-primer sequences can be found in one CCS read, the PacBio official IsoSeq3 pipeline would inherently define HIT-scISOseq reads as “chimeric”; thus, the pipeline was not considered suitable for our analysis. Therefore, a set of analysis tools (https://github.com/shizhuoxing/scISA-Tools) was developed as a pipeline for 10x Genomics ScISOr-Seq read processing. This pipeline included quality control, basic statistics, full-length transcript identification, cell barcode and UMI extraction and correction, isoform clustering, single-cell isoform quantification, and single-cell expression matrix format transformation. This pipeline is not only useful for HIT-scISOseq data but also works well in 10x Genomics systems based on the standard ScISOr-Seq protocol.

Api For Interactive Loupe Browser Visualization

Loupe Browser is an established desktop application that allows the interactive visualization of single-cell RNA data from the 10x Genomics platform. scMatrix2CellRangerH5 was developed in the present study, and is a utility that can convert a text matrix to an HDF5 format compatible with the CellRanger reanalyze pipelines. This allows “cloupe” files to be generated and visualized in Loupe Browser.

Single-cell Short-read Data Analysis

For each sample, the 10x Genomics CellRanger pipeline (version 3.1.0) was used to obtain a single-cell expression matrix based on the Macaca fascicularis genome and transcriptome (Ensembl Macaca_fascicularis_5.0.99).

Single-cell Isoform Sequencing And Bioinformatics Pipeline

Generation of circular consensus sequencing reads

Using SMRT-Link (version 8.0.0.80529), CCS reads were generated with the following modified parameters: “--min-passes 0 --min-length 50 --max-length 21000 --min-rq 0.75”.

Generation of single-cell full-length non-concatemer (FLNC) reads

First, the 5' and 3' primers were mapped to CCS reads using NCBI BLAST (version 2.10.0+)[30, 31] with the following parameters: “-outfmt 7 -word_size 5”. Then, primer BLAST results were used as inputs, and the classify_by_primer utility was employed to extract cell barcodes and UMIs. Finally, FLNC reads were generated with the following parameters: “-min_primerlen 16 -min_seqlen 50”. The functions of the classify_by_primer utility are briefly listed as follows: (1) parsing the standard pair of 5’ and 3' primers in CCS reads to obtain full-length isoforms, which were then oriented from the 5’ to the 3’ end; (2) trimming 5’ and 3' primer sequences, trimming the 28 bp sequences followed by the 3’ primers as cell barcodes and UMIs; and (3) trimming the 3’ polyA tail using a sliding window algorithm. As the program was strictly 5’ and 3’ primer paired one after another, each full-length read was oriented. The reads with primers, cell BCs, UMIs, and polyA tails were considered FLNC reads.

Genome alignment of FLNC reads

After FLNC detection and trimming procedures were completed, the primers, cell BCs, UMIs, and polyA tails could be identified. The remaining fraction of each FLNC was aligned to the Macaca fascicularis genome (Ensembl Macaca_fascicularis_5.0.99) by using minimap2 (version 2.17-r974-dirty)[32] in spliced alignment mode with the following parameters: “-ax splice -uf --secondary = no -C5”’.

Cell barcode and UMI correction

A strategy similar to that employed by 10x Genomics CellRanger was adopted. The cellBC correction function in CellRanger was warped as a module in the pipeline, named cellBC_UMI_corrector. This utility could handle long-read data independently, without the need to relate them to short-read information as a guide.

For cellBC correction, CellRanger based on known barcodes for given assay chemistry was stored in an “allowlist” file. The steps are briefly described as follows:

The observed frequency of every barcode on the “allowlist” in the data set was counted;

For every observed barcode situated 1-Hamming distance (substitution) away from the “allowlist”, the posterior probability that the observed barcode originated from the “allowlist” barcode with a sequencing error at the differing base (based on the base Q score) was computed. Next, the observed barcode was replaced by the “allowlist” barcode with the highest posterior probability exceeding 0.975.

The steps taken for UMI correction are briefly described as follows:

1. Basic quality filtering and correction for UMI sequencing errors with the following restrictions:

a) Must not be a homopolymer, e.g. AAAAAAAAAA;

b) Must not contain N;

c) Must not contain bases with base quality < 10.

2. UMIs located within 1 Hamming distance (substitution) from a higher-count UMI were corrected to the higher count UMI if they shared a cell barcode in each gene.

Generation of the single-cell gene count matrix

After mapping FLNCs to the genome, gffcompare (version 0.11.6)[33] was used and the FLNCs were assigned to Ensemble Macaca fascicularis annotation gene models (Ensembl Macaca_fascicularis_5.0.99). The reads were defined as exonic sequences when the class codes equaled ‘"= c k m n j e o"’. This procedure is consistent with the CellRanger pipeline. Next, the scGene_matrix utility was used to generate the single-cell gene expression data for each sample, based on the gffcompare output and corrected cellBC and UMI for each FLNC.

Collapsing redundant isoforms

The cDNA_Cupcake (https://github.com/Magdoll/cDNA_Cupcake) Python script “collapse_isoforms_by_sam.py” was used. The “--min-coverage” for minimum alignment coverage and the “--min-identity” for minimum alignment identity default settings were 0.99 and 0.95, respectively. This step ensures the generation of transcripts with high accuracy.

Nonredundant isoform classification, coding frame prediction, and UTR detection

SQANTI3 (https://github.com/ConesaLab/SQANTI3)[34] was used for the characterization, quality control and rules filter of nonredundant isoforms based on Ensembl Macaca fascicularis annotation gene models (Ensembl Macaca_fascicularis_5.0.99). Isoforms were classified as known or novel. SQANTI3 was used to call GeneMarkS-T (version 5.1 March 2014) for nonredundant isoform CDS coding frame prediction and UTR definition.

Generation of the single-cell isoform count matrix

After the collapsing procedure, the scIsoform_matrix utility was used to generate single-cell isoform expression quantities in each sample with the following parameters: “-minUMIcount 3”. We further filtered isoforms detected in fewer than 5 cells in all samples.

Expression matrix quality control

The Seurat R package (version 3.1.5)[35] was used to perform quality filtering analysis of single-cell genes and isoform expression matrix of each sample. The “min.cells = 5, nFeature_RNA > 200, nFeature_RNA < 6000, percent.mt < 25” command was used for the NGS gene expression matrices of s1 and s2 samples, the “min.cells = 5, nFeature_RNA > 100, nFeature_RNA < 3000, percent.mt < 25” function was used for the TGS gene expression matrix of s1 samples; and the “min.cells = 5, nFeature_RNA > 100, nFeature_RNA < 3500, percent.mt < 25” command was used for the TGS gene expression matrix of s2 samples.

Cell clustering and cell-type annotation

After the quality filtering procedure, the scMatrix2CellRangerH5 utility was used to convert the matrix to the CellRanger h5 format. Then, the CellRanger reanalysis pipeline was used for PCA and cell clustering, with the default parameters. The resulting “cloupe” files were loaded onto the Loupe Browser for adequate manual annotation of cell types and tuning adjustments. After cell-type annotation, the cell type- and cell barcode-associated tables were uploaded into the ‘Seurat’ R package (version 3.1.5) for downstream cell clustering and cell-type marker gene and marker-isoform expression heatmap generation.

Differential expression analysis of genes and isoforms

The Seurat R package (version 3.1.5)[36] was used for cell-type gene and isoform marker identification and differential expression analyses.

Generation of isoforms structure view

Selected isoforms of interest were imported as GTF files into IGV (version 2.8.2)[37] for splicing structure viewing.

Data Availability

All data sets used in this study were generated using in-house sequencing and were deposited in the Genome Sequence Archive in the BIG Data Center, Beijing Institute of Genomics (BIG, http://gsa.big.ac.cn), Chinese Academy of Sciences, with Project Accession No. “PRJCA003458[https://bigd.big.ac.cn/bioproject/browse/PRJCA003458]”, and GSA Accession No. “CRA003228[https://bigd.big.ac.cn/gsa/browse/CRA003228]", “CRA003234[https://bigd.big.ac.cn/gsa/browse/CRA003234]".

Code Availability

The HIT-scISOseq analysis pipeline and source code are available from https://github.com/shizhuoxing/scISA-Tools.

Protocol Availability

The HIT-scISOseq protocol could be found on https://www.protocols.io/private/7472E845C45C11EC97780A58A9FEAC02.

Author contributions

Y.Z.L., C.L.X., C.T., and X.C.B. conceived and designed the project. C.T., Z.C.C., Z.X.S., and C.L.X. developed the experimental technology; Y.F.Z. collected the monkey cornea samples; C.T. and Z.C.C. performed single-cell sequencing experiments. C.L.X. and X.C.B. guided bioinformatics analyses; Z.X.S., J.Y.Z., and Y.F.Z. developed the data analysis pipeline and some documentation. Z.X.S. and J.Y.Z. performed the informatics analysis; Y.C., S.Q.X., and F.L. coordinated data release and assisted with executing the pipeline. Y.F.Z, C.L.X., Z.X.S., J.Y.Z., C.T., and F. L. wrote the manuscript and created the figures; L.Y.Z. and K.H.H. advised the study and revised the manuscript. All authors have read and approved the final version of this manuscript.

Conflict of interest disclosure

Z.C.C., and C.T. are employees of BGI Genomics. The authors declare no competing interests.

Funding/Supports

This study was supported by the CAMS Innovation Fund for Medical Sciences (2019-I2M-5-005), National Natural Science Foundation of China (81530028; 81721003, 31871326, 91953122, 32100522, 42107148); the Local Innovative and Research Teams Project of Guangdong Pearl River Talents Programme (2017BT01S138); Clinical Innovation Research Program of Guangzhou Regenerative Medicine and Health Guangdong Laboratory (2018GZR0201001), the State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University. This work was supported in part by the U. S. National Institute of Food and Agriculture (NIFA; grant number 2017-70016-26051) and U.S. National Science Foundation (NSF; grant number ABI-1759856, MTM2-2025541) to F. L.

F.C. Tang, C. Barbacioru, Y.Z. Wang, E. Nordman, C. Lee, N.L. Xu, X.H. Wang, J. Bodeau, B.B. Tuch, A. Siddiqui, K.Q. Lao, and M.A. Surani, mRNA-Seq whole-transcriptome analysis of a single cell. Nature Methods, 2009. 6(5): p. 377-U86.
A.E. Saliba, A.J. Westermann, S.A. Gorski, and J. Vogel, Single-cell RNA-seq: advances and future challenges. Nucleic Acids Research, 2014. 42(14): p. 8845–60.
M.V. Fuccillo, C. Foldy, O. Gokce, P.E. Rothwell, G.L. Sun, R.C. Malenka, and T.C. Sudhof, Single-Cell mRNA Profiling Reveals Cell-Type-Specific Expression of Neurexin Isoforms. Neuron, 2015. 87(2): p. 326–40.
S. Petropoulos, D. Edsgard, B. Reinius, Q. Deng, S.P. Panula, S. Codeluppi, A. Plaza Reyes, S. Linnarsson, R. Sandberg, and F. Lanner, Single-Cell RNA-Seq Reveals Lineage and X Chromosome Dynamics in Human Preimplantation Embryos. Cell, 2016. 165(4): p. 1012–26.
J.J.W. Seow, R.M.M. Wong, R. Pai, and A. Sharma, Single-Cell RNA Sequencing for Precision Oncology: Current State-of-Art. Journal of the Indian Institute of Science, 2020: p. 1.
H.M. Kang, M. Subramaniam, S. Targ, M. Nguyen, L. Maliskova, E. McCarthy, E. Wan, S. Wong, L. Byrnes, and C.M. Lanata, Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nature Biotechnology, 2018. 36(1): p. 89.
A. Arzalluz-Luque and A. Conesa, Single-cell RNAseq for the study of isoforms-how is that possible? Genome Biology, 2018. 19.
R. Volden and C. Vollmers, Single-cell isoform analysis in human immune cells. Genome Biol, 2022. 23(1): p. 47.
S.A. Hardwick, W. Hu, A. Joglekar, L. Fan, P.G. Collier, C. Foord, J. Balacco, S. Lanjewar, M.M. Sampson, F. Koopmans, A.D. Prjibelski, A. Mikheenko, N. Belchikov, J. Jarroux, A.B. Lucas, M. Palkovits, W. Luo, T.A. Milner, L.C. Ndhlovu, A.B. Smit, J.Q. Trojanowski, V.M.Y. Lee, O. Fedrigo, S.A. Sloan, D. Tombacz, M.E. Ross, E. Jarvis, Z. Boldogkoi, L. Gan, and H.U. Tilgner, Single-nuclei isoform RNA sequencing unlocks barcoded exon connectivity in frozen brain tissue. Nat Biotechnol, 2022. 40(7): p. 1082–1092.
E. Rebboah, F. Reese, K. Williams, G. Balderrama-Gutierrez, C. McGill, D. Trout, I. Rodriguez, H.D. Liang, B.J. Wold, and A. Mortazavi, Mapping and modeling the genomic basis of differential RNA isoform expression at single-cell resolution with LR-Split-seq. Genome Biology, 2021. 22(1).
M. Philpott, J. Watson, A. Thakurta, T. Brown, Jr., T. Brown, Sr., U. Oppermann, and A.P. Cribbs, Nanopore sequencing of single-cell transcriptomes with scCOLOR-seq. Nat Biotechnol, 2021. 39(12): p. 1517–1520.
K. Lebrigand, V. Magnone, P. Barbry, and R. Waldmann, High throughput error corrected Nanopore single cell transcriptome sequencing. Nat Commun, 2020. 11(1): p. 4025.
R. Volden and C. Vollmers, Highly Multiplexed Single-Cell Full-Length cDNA Sequencing of human immune cells with 10X Genomics and R2C2. BioRxiv, 2020.
A. Byrne, A.E. Beaudin, H.E. Olsen, M. Jain, C. Cole, T. Palmer, R.M. DuBois, E.C. Forsberg, M. Akeson, and C. Vollmers, Nanopore long-read RNAseq reveals widespread transcriptional variation among the surface receptors of individual B cells. Nature Communications, 2017. 8.
M. Hagemann-Jensen, C. Ziegenhain, P. Chen, D. Ramskold, G.J. Hendriks, A.J.M. Larsson, O.R. Faridani, and R. Sandberg, Single-cell RNA counting at allele and isoform resolution using Smart-seq3. Nat Biotechnol, 2020. 38(6): p. 708–714.
I. Gupta, P.G. Collier, B. Haase, A. Mahfouz, A. Joglekar, T. Floyd, F. Koopmans, B. Barres, A.B. Smit, S.A. Sloan, W.J. Luo, O. Fedrigo, M.E. Ross, and H.U. Tilgner, Single-cell isoform RNA sequencing characterizes isoforms in thousands of cerebellar cells. Nature Biotechnology, 2018. 36(12): p. 1197-+.
I. Legnini, J. Alles, N. Karaiskos, S. Ayoub, and N. Rajewsky, FLAM-seq: full-length mRNA sequencing reveals principles of poly (A) tail length control. Nature Methods, 2019. 16(9): p. 879–886.
Y. Liu, H. Nie, H. Liu, and F. Lu, Poly (A) inclusive RNA isoform sequencing (PAIso – seq) reveals wide-spread non-adenosine residues within RNA poly (A) tails. Nature Communications, 2019. 10(1): p. 1–13.
Pacific Biosciences Single-Cell Iso-Seq Library Preparation Using SMRTbell Express Template Prep Kit 2.0 Customer Training. https://www.pacb.com/wp-content/uploads/Single-Cell-Iso-Seq-Library-Preparation-Using-SMRTbell-Express-Template-Prep-Kit-2.0-Customer-Training.pdf, 2020: p. 32.
A.M. Wenger, P. Peluso, W.J. Rowell, P.C. Chang, R.J. Hall, G.T. Concepcion, J. Ebler, A. Fungtammasan, A. Kolesnikov, N.D. Olson, A. Topfer, M. Alonge, M. Mahmoud, Y. Qian, C.S. Chin, A.M. Phillippy, M.C. Schatz, G. Myers, M.A. DePristo, J. Ruan, T. Marschall, F.J. Sedlazeck, J.M. Zook, H. Li, S. Koren, A. Carroll, D.R. Rank, and M.W. Hunkapiller, Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnology, 2019. 37(10): p. 1155–1162.
U. Schlecht, J. Mok, C. Dallett, and J. Berka, ConcatSeq: A method for increasing throughput of single molecule sequencing by concatenating short DNA fragments. Sci Rep, 2017. 7(1): p. 5252.
N. Kanwar, C. Blanco, I.A. Chen, and B. Seelig, PacBio sequencing output increased through uniform and directional fivefold concatenation. Sci Rep, 2021. 11(1): p. 18065.
Z.-C.C. Ying-Feng Zheng, Zhuo-Xing Shi, Kun-Hua Hu, Jia-Yong Zhong, Chun-Xiao Wang, Wen Shi, Ying Chen, Shang-Qian Xie, Feng Luo, Xiao-Chen Bo, Chong Tang, Yi-Zhi Liu, Chuan-Le Xiao, HIT-scISOseq: High-throughput and High-accuracy Single-cell Full-length Isoform Sequencing for Corneal Epithelium. bioRxiv, 2020.
J.T.S. Aziz M. Al’Khafaji, Kiran V Garimella, Mehrtash Babadi, Moshe Sade-Feldman, Michael Gatzen, Siranush Sarkizova, Marc A. Schwartz, Victoria Popic, Emily M. Blaum, Allyson Day, Maura Costello, Tera Bowers, Stacey Gabriel, Eric Banks, Anthony A. Philippakis, Genevieve M. Boland, Paul C. Blainey, Nir Hacohen, High-throughput RNA isoform sequencing using programmable cDNA concatenation. bioRxiv, 2021.
T. Yin, W. Yao, A.D. Lemenze, and L. D'Adamio, Danish and British dementia ITM2b/BRI2 mutations reduce BRI2 protein stability and impair glutamatergic synaptic transmission. J Biol Chem, 2021. 296: p. 100054.
R. Liu, G. Yang, M. Bao, Z. Zhou, X. Mao, W. Liu, X. Jiang, D. Zhu, X. Ren, J. Huang, and C. Chen, STAMBPL1 promotes breast cancer cell resistance to cisplatin partially by stabilizing MKP-1 expression. Oncogene, 2022. 41(16): p. 2265–2274.
F. Shi, L. Sun, and S. Kaptoge, Association of beta-2-microglobulin and cardiovascular events and mortality: A systematic review and meta-analysis. Atherosclerosis, 2021. 320: p. 70–78.
C.E. Friedman, Q. Nguyen, S.W. Lukowski, A. Helfer, H.S. Chiu, J. Miklas, S. Levy, S. Suo, J.J. Han, P. Osteil, G. Peng, N. Jing, G.J. Baillie, A. Senabouth, A.N. Christ, T.J. Bruxner, C.E. Murry, E.S. Wong, J. Ding, Y. Wang, J. Hudson, H. Ruohola-Baker, Z. Bar-Joseph, P.P.L. Tam, J.E. Powell, and N.J. Palpant, Single-Cell Transcriptomic Analysis of Cardiac Differentiation from Human PSCs Reveals HOPX-Dependent Cardiomyocyte Maturation. Cell Stem Cell, 2018. 23(4): p. 586–598 e8.
B. Deonovic, Y. Wang, J. Weirather, X.-J. Wang, and K.F. Au, IDP-ASE: haplotyping and quantifying allele-specific expression at the gene and gene isoform level by hybrid sequencing. Nucleic Acids Research, 2017. 45(5): p. e32-e32.
S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman, Basic local alignment search tool. Journal of Molecular Biology, 1990. 215(3): p. 403–410.
C. Camacho, G. Coulouris, V. Avagyan, N. Ma, J. Papadopoulos, K. Bealer, and T.L. Madden, BLAST+: architecture and applications. BMC Bioinformatics, 2009. 10(1): p. 421.
H. Li, Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 2018. 34(18): p. 3094–3100.
G. Pertea and M. Pertea, GFF Utilities: GffRead and GffCompare. F1000Research, 2020. 9.
M. Tardaguila, L. De La Fuente, C. Marti, C. Pereira, F.J. Pardo-Palacios, H. Del Risco, M. Ferrell, M. Mellado, M. Macchietto, and K. Verheggen, SQANTI: extensive characterization of long-read transcript sequences for quality control in full-length transcriptome identification and quantification. Genome Research, 2018. 28(3): p. 396–411.
A. Butler, P. Hoffman, P. Smibert, E. Papalexi, and R. Satija, Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nature Biotechnology, 2018. 36(5): p. 411–420.
L. de la Fuente, Á. Arzalluz-Luque, M. Tardáguila, H. del Risco, C. Martí, S. Tarazona, P. Salguero, R. Scott, A. Lerma, and A. Alastrue-Agudo, tappAS: a comprehensive computational framework for the analysis of the functional impact of differential splicing. Genome Biology, 2020. 21(1): p. 1–32.
H. Thorvaldsdóttir, J.T. Robinson, and J.P. Mesirov, Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Briefings in Bioinformatics, 2013. 14(2): p. 178–192.

There is NO Competing Interest.

SupplementaryFigures.docx
SupplementaryTables111.xlsx
Supplmentary Tables
NCOMMS2230209Acssc.pdf
NCOMMS2230209Ars.pdf

Download PDF

Journal Publication

published 06 May, 2023

Read the published version in Nature Communications →

Version 1

posted

You are reading this latest preprint version

HIT-scISOseq: High-throughput and High-accuracy Single-cell Full-length Isoform Sequencing

Status:

Journal Publication

Version 1

Abstract

Figures

Introduction

Results

HIT-scISOseq design

Performance Of Hit-scisoseq Sequencing Runs

Hit-scisoseq Assigns Cell Barcodes With High Accuracy

Hit-scisoseq Gene Expression Clustering Of Corneal Limbus Single Cells Into Cell Types

Hit-scisoseq Captures Single-cell Isoform Expression In Corneal Limbus

Hit-scisoseq Revealed Cell-type-specific Isoform Expression Changes In The Corneal Limbus

Discussion

Methods

Monkey limbal sampling experiment

Single-cell Dissociation

10x Genomics Single-cell Capture And Illumina Library Preparation

Cdna Amplification And Capture For Pacbio Library Construction

User Cloning-based Ligation Of Multiple Inserts

Pacbio Smrtbell Template Preparation And Sequencing

Hit-scisoseq Data Processing Pipeline

Api For Interactive Loupe Browser Visualization

Single-cell Short-read Data Analysis

Single-cell Isoform Sequencing And Bioinformatics Pipeline

Generation of circular consensus sequencing reads

Generation of single-cell full-length non-concatemer (FLNC) reads

Genome alignment of FLNC reads

Cell barcode and UMI correction

Generation of the single-cell gene count matrix

Collapsing redundant isoforms

Nonredundant isoform classification, coding frame prediction, and UTR detection

Generation of the single-cell isoform count matrix

Expression matrix quality control

Cell clustering and cell-type annotation

Differential expression analysis of genes and isoforms

Generation of isoforms structure view

Data Availability

Code Availability

Protocol Availability

Declarations

References

Additional Declarations

Supplementary Files

Status:

Journal Publication

Version 1