1. Animal husbandry
Amphioxus Branchiostoma floridae were obtained from Dr. Jr-Kai Yu’s laboratory at the Institute of Cellular and Organismic Biology, Academia Sinica, Taiwan. They were maintained and induced to spawn following the protocol as we described and used in B. belcheri41,42. Eggs fertilization and embryos culture were carried out according to our previous report43. Embryos are staged according to a recent study9.
2. Single cell RNA sequencing library preparation
We prepared snRNA-seq libraries on the Split-seq platform10. Embryos at indicated stages were harvested and stored in RNAlater solution (AM7020, Ambion). Nuclei extraction was performed as described10. Briefly, indicated embryos were transferred into a 1 mL Dounce homogenizer containing 1mL homogenizing buffer (250 mM sucrose, 25 mM KCl, 5 mM MgCl2, and 10 mM Tris [pH=8.0]; 1 μM DTT, RNase Inibitor and 0.1% Triton-X100). 5-10 strokes of loose pestle followed by 10-20 strokes of tight pestle were performed. The homogenates were filtered with a 40 μm strainer into 5 mL Eppendorf tubes and then spun down for 4 minutes at 600 g at 4oC. The pellet was re-suspended and washed in 1 mL of PBS containing RNase inhibitors and 0.1% BSA. At last the nuclei were passed through a 40 μm strainer again before being counted. The nuclei were split into 48 wells, each containing barcoded well-specific reverse transcription primer, for in-cell reverse transcription. The second and third barcoding consist of a ligation reaction. After the third round of barcoding, the nuclei were divided into 16 aliquots and then lysed before cDNA purification. Purified cDNA was subjected to template switch and a round of real-time PCR amplification. PCR reactions were stopped at the beginning of plateau stage. At last, 600 pg purified PCR products each were used to generate Illumina compatible sequencing libraries. Each library was labeled by a distinct, indexed PCR primer pair, which is served as the fourth barcode.
3. Single-nucleus RNA sequencing and data processing
Libraries were sequenced on NextSeq systems (Illumina) using 150 nucleotide kits and paired-end sequencing. Read 1 covered the transcript sequences and read 2 covered the UMI and UBC barcode combinations. Firstly, we added the fourth barcode (sequencing index, 6 nt) at read 2 ends, then discarded reads which had more than one mismatched base with the third barcode. Thirdly, any reads in UMI region had more than one low quality base (phred <=10) were also discarded. The sequencing results were aligned to exons and introns in the reference genome (https://www.ncbi.nlm.nih.gov/assembly/GCA_015852565.1/) and aggregated intron and exon counts at the gene level were calculated by kallisto and bustools software as described (https://bustools.github.io/BUS_notebooks_R).
4. Single-cell ATAC-seq library preparation
The embryos harvested at indicated stages were lysed in cold lysis buffer (10 mM Tris-HCl [pH 7.4], 10 mM NaCl, 3 mM MgCl2, 0.1% Tween-20, 0.1% NP40, 0.01% Digitonin and 1% BSA). The nuclei were extracted by gentle pipetting for 10 times. Larvae were homogenized in 2 mL Dounce homogenizer (SIGMA) containing 2 mL lysis buffer. Dounce homogenization and filtration were performed as described above10. Nuclei were pelleted by spinning at 500 g for 5 min at 4oC. Then nuclei were washed twice by suspending pellet in chilled PBS (Gibco) with 0.04% BSA (BBI). The nuclei were re-suspended in diluted nuclei buffer (10x Genomics). The 10x ATAC libraries were constructed according to the Single Cell ATAC v1 workflow (https://www.10xgenomics.com/products/single-cell-atac), and proceeded with the MGI Easy Universal Library Conversion Kit (App-A, MGI) to convert the libraries’ structure.
5. Single-cell ATAC sequencing and data processing
Libraries were sequenced on MGI2000 (MGI). Raw data were split into reads and cell barcode using custom scripts. Reads were mapped to reference genome (https://www.ncbi.nlm.nih.gov/assembly/GCA_015852565.1/) using Cell Ranger ATAC v1.2.0 with default parameters. Single-cell accessibility counts were generated after running ‘cellranger-atac count’.
6. Analysis of snRNA-seq data
6.1 Data quality control and normalization for snRNA-seq data
After the matrix was exported, quality control was performed to remove low quality cells and potential doublets. Considering the range of cell library sizes (i.e. sequencing depth) varied among stages, we costumed different filtering thresholds (based on the number of detected genes) for each embryonic stage, which were: B (400-3,000 genes), G3 (350-3,000 genes), G4 (250-3,000 genes), G5 (150-3,000 genes), G6 (350-3,000 genes), N0, N1 and N3 stage (150-3,000 genes), L0 stage (100-3,000 genes). The threshold for filtering the adult tissue cells was 150-3,000 genes. As a result, a total of 148,875 embryonic cells and 235,170 adult tissue cells were remained for subsequent analysis. Furthermore, genes expressing in less than 3 cells were filtered out. This step was implemented using the build-in functions ‘scanpy.pp.filter_cells’ and ‘scanpy.pp.filter_genes’ from ScanPy44.
The resulting gene-by-cell matrices were then log-transformed (with a pseudo-count added) followed by the library-size normalization, with the median of library-size as the size-factor. This was implemented using the two build-in functions ‘scanpy.pp.normalize_total’ and ‘scanpy.pp.log1p’ from ScanPy, for total-count normalization and log-transformation respectively.
6.2 Cell clustering and visualization within each stage
Selection of highly variable genes
We observed that cell groups were dominated by two types of RNA capturing primers (random and polyT) (Supplementary Figure 1). Unless special circumstances, we will refer to these two groups caused by technical noise as “polyT group” and “random group”, respectively. To avoid their effects, highly variable genes (HVGs) were first identified separately within each group, and then merged together. Cells in B stage were exceptional, which were clustered into three populations that were driven by different states of cell division, so the HVGs were selected separately in each division states.
We used the approach in Seurat v2 to identify HVGs within each group. Specifically, it calculated the average expression and dispersion (variance or mean) for each gene and placed these genes into several bins (30 bins in our case) based on (logarithmized) average expression. The normalized dispersions were then obtained by scaling with the mean and standard deviation of the dispersions within each bin. Genes with a (log-normalized) mean expressions above a certain value (costumed for each stage) and a normalized dispersion higher than 0.25 were identified as highly variable ones. This was implemented using the build-in function ‘scanpy.pp.highly_variable_genes’ from ScanPy. At last, we took those genes that were highly variable in both groups and those with top dispersion as HVGs for downstream analysis (Supplementary Table 1 and 2).
Preprocessing for dimensionality reduction
We restricted the expression matrix to those genes found highly variable. Before performing dimensionality reduction, the gene-by-cell expression matrix was centralized and scaled. This was done for each group of the same RNA capturing primer as described above. For B stage, the expression matrix was centered and scaled within each cell cycle state. Besides, cell groups in this stage were much more confounded by cell library size than other stages so we treated the logarithmized library size (i.e., counts-per-cell) as the latent factor and regressed it out.
Visualization of stage-wise snRNA-seq data using UMAP
We first performed principle component analysis (PCA) and selected the number of top principal components (PCs) with highest explained variances based on the “elbow” of the scree plot of the principle components. In practice, we found the final results are quite robust to the number of selected PCs. UMAP was performed to further embed each cell from the reduced PC space onto a 2D map. UMAP is a graph-base manifold learning method that can provide a good visualization with the intrinsic structure of the original data preserved. Meanwhile it is also a computationally efficient tool for large-scale datasets. It first computes the approximate k nearest neighbors (KNNs) for each data point, building a weighted mutual-KNN graph with each node representing each data point (cell in our case), and embeds each node of the graph into the lower dimensional space. Note that the Euclidean distance metric is not scale-invariant so that it is quite sensitive to batch effects. Instead, we searched the approximate KNNs for each single cell based on cosine distance in the PC space, with the number of neighbors setting as 20. This was implemented using the build-in function ‘scanpy.tl.umap’ from ScanPy, which is a convenient wrapper of the original function ‘UMAP’ from the Python package ‘umap-learn’.
Clustering of cell populations and identification of differentially expressed genes
To cluster single cells into distinct populations, we used a graph-based clustering approach, which applies Leiden45 community detection on the weighted KNN graph build by UMAP. The Leidenalgorithm is very similar to the Louvain46 community detection algorithm that is wildly used for single cell clustering. This clustering method was achieved by the build-in interface ‘scanpy.tl.leiden’ from ScanPy. Cluster-pecific genes were found using “model-based analysis of single-celltranscriptomics” (MAST)12 comparing each cluster versus the others, achieved by the function ‘FindAllMarkers’ in Seurat. After comparing the differentially expressed genes for each cluster, we manually merged those groups with no significant difference from each other.
6.3 Visualization of the embryonic cells from all the developmental stages
Construction of the single-cell graph for the merged snRNA-seq data
In order to construct a 2D atlas of single-cells from all the developmental stages that reveals both the stage order and the intrinsic lineage structures, we first built a single-cell graph, with nodes as cells and edges connecting cells with similar expression profiles. A simple and direct approach is to apply a global k-nearest-neighbor search for each cell among all the stages. However, due to the batch effects caused by technical and biological noises, the difference of the same cell type (or lineage) across stages might be greater than that of different cell types (or lineage) in the same stage, which can lead to the k-nearest neighbors of each cell more likely to be in the same stage rather than its progenitor cells or daughter cells from the adjacent stages. Therefore, a global k-nearest neighbor search without any constraints would result in a biased single-cell network. To remove these batch effects and keep the order of stages, we borrowed the idea from a study by Wagner et.al.7 and designed a novel approach, stage-wise-KNN, for our scRNA-seq dataset.
For genes used to calculate the single-cell network, we merged HVGs from each stage, and kept those appeared in more than three stages. We also included some canonical marker genes collected from published literatures. Before inputting to the next step, the expression matrices from each stage were first z-scored (mean-shifted and scaled to unit variance) for each of the used genes. Let’s denote the resulting matrix X(t) at time point t, with each element representing the z-score of gene i in cell j.
The single-cell network was constructed by connecting only the adjacent stage-pairs, with no edges connecting cells in the same stage, except for the first stage (B stage). Specifically, for each pair of adjacent stages between time point t and t+1, we first applied PCA on the concatenated matrix to get the reduced dimensions, and searched the KNNs of each cell in stage t+1 from its parent stage t, based on the PC space. A binary edge would be connecting two cells if one is in the parent stage (the earlier timepoint) and is detected as a member of the KNNs of the other. As the start timepoint, the B stage was the only exception that cells in which would also connect to their KNNs in the same stage.
This single-cell network preserved the order of stages and revealed the structural relationships between different lineages. We reasoned that if two cells in the same stage had similar expression profiles, they would have a majority of KNNs in common from the previous stage, which would “attract” them to be embedded onto a nearby place in the consequent 2D map.
Parameter setting
Considering the different number of cells and the varied biological complexity in different stages, we adapted different number of PCs and nearest-neighbors for each pair of stages. We used the top 30 PCs for pairs from B to G6 stage, and the top 50 PCs for that from B6 to L0 stage. The size of neighborhood for cells from B to N1 stage was defined as k=10, while they were set as k=5 and k=3 for N3 and L0 stage, respectively.
Embedding the single-cell network using a force-directed graph layout algorithm
To apply a force-directed graph layout algorithm to embed the single-cell network, we utilized the build-in function ‘scanpy.tl.umap’ of ScanPy after setting the pre-built network in the slot “connectivities”. The parameter “min_dist” of this function was set as 0.1.
6.4 Construction of the developmental tree
The coarse-grained developmental tree was constructed by taking previously defined clusters in each stage as nodes and connecting nodes across timepoints by “ancestor voting”. A vote from a cell for its ancestor cell was defined as the nearest neighbor in the previous stage, which was determined when building the stage-wise-KNN based single-cell network. For each cluster in stage t+1, its ancestor node in stage t would be the cluster winning the largest number of votes from cells in this cluster.
A step of “group refinement” would come after the “ancestor voting” between each two adjacent stages. In case that some clusters in stage t had no descendent nodes in stage t+1, probably because of different cluster resolutions, they would take all the single-cells that had voted for them from the other clusters and group a new descendent node.
We also made some manually adjustment to the developmental tree using expert knowledge, for example, merging those branches with no significant gene expression difference. Finally, the constructed developmental tree was visualized using the R package ‘ggtree’ 47.
6.5 Annotation of embryonic cell populations and lineages
Considering the unbalanced sequencing depths of different stages, library-size normalization was applied to the raw expression counts of each cell with 1000 as the uniform size-factor, followed by log-transformation with a pseudo-count added. Lineage-specific genes were then found using MAST by comparing the expression profiles of cells in each lineage with those of the others, implemented by the function ‘FindAllMarkers’ in Seurat.
GO enrichment analysis of DEGs was performed by an R package “clusterProfiler” (https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html) with the customized reference database “org.bf.eg.db” (Branchiostoma floridae). The statistical significance was adjusted and the “pvalueCutoff” parameter was set as 0.05 for both GO and KEGG analysis. Top 20 significantly enriched GO terms were selected to show by bubble plot.
6.6 Notochord lineage analysis
The notochord lineage cells were extracted out for further lineage analysis. HVGs were selected and dimensionality reduction was performed using UMAP. MAST was used to find out the differential expressed genes for each cluster that formed the notochord lineage. Diffusion pseudo-time (DPT)48 was applied to infer the pseudo-time of each cell.
The single-cell graph with re-clustering labels were further abstracted into a more concise and interpretable graph by partition-based graph abstraction (PAGA)44. In brief, each cluster was regarded as a node, and the weight of a connection between each pair of nodes was estimated based on a hypothesis test against a null model. This gave us a coarse-resolution of the partitioning and allowed a more global perspective of the intrinsic structure of the data.
7. Analysis of single-cell ATAC-seq data
7.1 Quality control and preprocessing of scATAC-seq data
Peak-cell matrics and fragment files were analyzed with Signac (v 0.2.5). For the consequent peak-cell matrix, we filtered cells that were not adequately sequenced compared to the main populations in the same stage. As done for snRNA-seq data, we customized different filtering thresholds (based on the number of detected peaks) for each embryonic stage, which were: B (600-3,500 peaks), G3 (1,500-10,000 peaks), G6 (1,000-15,000 peaks), N1 and N3 (1,500-10,000 peaks), L0 (1,500-15,000 peaks). The resulting number of cells for each stage was listed in the Supplemetary Table 2.
7.2 Independent visualization of scATAC-seq data for each stage
For independent visualization of the epigenomic data, TF-IDF transformation was applied to the peak-cell matrix of each stage and partial singular value decomposition (SVD) and UMAP were performed for dimensionality reduction. These steps were implemented using the built-in functions of Signac (RunTFIDF with scale.factor=10000, RunSVD with n=50 and reduction.key = 'LSI_') and Seurat (RunUMAP with n.neighbors = 30).
7.3 Construction of gene activity matrix from scATAC-seq data using Cicero
Before integrating the scATAC-seq data with the snRNA-seq data, we first used Cicero35 to transform the peak-by-cell matrices of different developmental stages into gene-by-cell matrices, with the values representing gene activity scores. These gene activity matrices were then normalized to the uniform column-sums, the median of the column-sums in that stage, followed by log transformation with a pseudo-count added.
8. Integration of snRNA-seq and scATAC-seq data and label transfer
For each developmental stage with both transcriptomic and epigenetic data, we began with the normalized gene activity scores and the normalized gene expression values calculated from the previous analysis, z-scored (centered and scaled to unit variances) separately based on the HVGs selected from snRNA-seq data, which was used for clustering analysis before. The resulting z-scored matrix-pair in that stage was concatenated and used to calculating the reduced dimensions using PCA. We then corrected the coordinates in PC space by performing Harmony, a method for removing batch effects. After that, a KNN-classifier was built using the corrected coordinates and the cluster labels of snRNA-seq data as the “training samples” and that from scATAC-seq data as the “testing samples”, with the predicted cluster labels as the final transferred labels for those scATAC-seq cells.
Note that we also applied the Seurat build-in method (CCA-anchor) for integration and label-transfer. The most of transferred labels was in consistent with that from Harmony-KNN, while with less confidences, so we adopted the results of Harmony-KNN for the downstream analysis.
9. Combined analysis to identify new lineage specific genes
To identify new lineage marker genes, we carried out combined analysis using our epigenetic and transcriptomic data from each stage. Briefly, we take the intersections of the DEGs (p < 0.01) and differentially accessible peaks (p < 0.001). The candidate new lineage marker genes were selected with high expression in two successive stages.
10. Whole mount in situ hybridization assays
Fourteen genes showing specific expression in different germ layers in both single-nuclear RNA-seq and single-cell ATAC-seq data were chosen for this analysis. Their mRNA sequences were amplified from a mixed cDNA library using primers listed in (Supplementary Table 5), cloned to pGEM-T-Easy vector (PR-A1360, Promega), and verified by DNA sequencing. Digoxigenin (DIG)-labelled antisense riboprobes (11093088910, Roche) of all these genes were synthesized with Sp6 (P1081, Promega) or T7 (P2075, Promega) RNA polymerase. Embryos at desired stages were fixed overnight with 4% (wt/vol) PFA-MOPS-EGTA (pH 7.5) at 4℃, and then stored at -20℃ in 70% ethanol (vol/vol) for use. Whole mount in situ hybridization (WISH) was performed as previously described (Yu and Holland 2009 Cold Spring Harb Protoc). Stained embryos were photographed using an inverted microscope (Olympus, IX71).
11. Construction of website and database
HTML and PHP were used for webpage construction, and MySQL was used for the storing and query functions. All the code was installed on a Linux Host with Apache webserver.
12. Data availability
All the snRNA-seq and scATAC-seq data have been deposited in the CNSA (https://db.cngb.org/cnsa/) of CNGBdb with accession number CNP0000891.