Development of the Super deepSAGE technology
A flowchart of the Super deepSAGE experiment is summarized in Fig. 1. Dynabeads® M-270 Amine (Thermo Fisher Scientific, China) were coupled with –C6-SH labeled reverse transcription-primer with the sequence containing the 5’-CAGCAG-3’ recognition site of EcoP15I and an Oligo(dT) sequence at 3’ end designed intentionally to complement the poly(A) sequence of mRNAs (Synthesized by Sangon Biotech, China). The coupling procedure was carried out following protocol reported by Hill and Mirkin [16] using the succinimidyl 4-(p-maleimidophenyl)butyrate (SMPB) crosslink reagent (Thermo scientific, Shanghai, China). Ten micrograms of mRNA were reverse-transcribed (cDNA synthesis system, Invitrogen) with the Oligo(dT) magnetic beads to generate single-stranded cDNA using protocol recommended by the manufacturer. The product was converted to double-stranded cDNA using random primer and then digested with NlaIII (NEB, Beijing, China). The biotin-labeled linkers (linker-5EA) with phosphorylated 5’ termini and 3’ end overhang (5’-CATG-3’), containing the EcoP15I recognition site were prepared by annealing commercially synthesized oligonucleotides. The magnetic beads-bound cDNA was washed and linked to linker-5EA by T4 DNA ligase (NEB, Beijing, China). As a result, each cDNA fragment bounded to the magnetic beads is flanked by two inverted repeats of EcoP15I recognizing sites. The type III restriction enzyme EcoP15I cleaves the DNA downstream of the recognizing site (25 nt in one strand and 27 nt in the other strand) leaving a 5’ end overhang of two bases [17, 18]. Linker-ligated cDNAs on the magnetic beads were digested with ten units of EcoP15I under conditions described previously [19]. The supernatant containing released biotin-labeled fragments were added to streptavidin magnetic beads (Promega, Beijing, China), and the biotin-labeled fragments of the cDNAs were captured. Finally, barcoded linkers (linker-3EA) with two random base overhangs at 5’ end and phosphorylated termini were prepared and ligated to the cDNA ends by T4 DNA ligase (NEB, Beijing, China). The resulting products were amplified by polymerase chain reaction (PCR), and the 119 bp product was separated by polyacrylamide gel electrophoresis (PAGE) and recovered from the gel. The barcoded libraries prepared from different samples were combined into a single multiplex sequencing reaction at the end of library construction and submitted for deep sequencing. The sequence information of synthetic oligos, linkers, and primers are available in Supplemental document 1.
The serial analysis of gene expression (SAGE) was first developed by Velculescu et al. [20] and improved by Saha et al. [21], Matsumura et al. [19], and Nielsen et al. [22]. The traditional SAGE library construction protocol includes multiple steps, and the separation of the linker-tag fragment is challenging to perform, and the PAGE purification often produces low yield. The library construction protocol in this study was improved by introducing two magnet beads: 1) Dynabeads® M-270 Amine coupled with –C6-SH labeled Oligo(dT) reverse transcription primer; 2) The streptavidin magnetic beads which can capture biotin-labeled linkers (linker-5EA). The magnetic beads used in this protocol can capture and purify the DNA fragments and is technically less demanding than PAGE separation. This modification increased the yield of linker-tag fragments and resulted in the robustness of the technique. Also, the primers and linkers were designed compatible with multiplexed deep sequencing technology, saving the sequencing cost.
Animals, samples collection, and deep sequencing
A total of 224 tissue samples across 28 different tissues were collected from a slaughtering farm located in Hubei province in China. The samples were collected from 32 animals from a Duroc × Landrace × Yorkshire (DLY) commercial crossbreed pig populations consisting of 16 males and 16 females with a median age of 21 days. The endometrium, placenta, and conceptus were collected from Landrace × Yorkshire (LY) sows of 65 days of gestation. The detailed sample information is available in Table 1. In the computational extraction of tags from sequence data, the in-house designed program removes the two bases at the 5’ end. This ‘digital removal’ is performed to minimize the less accurate effect of two random bases, at the 5’ end of linker-3EA, and could potentially reduce the length of tags, and affect the representative ability of the data. However, direct link with a linker that has two random bases at the 5’ end forming stick ends will 1) enhance the efficiency of the link assay, and 2) no additional blunt ending process was needed. The inaccuracy caused by this linkage process was removed by the ‘digital removal’ procedure, thereby lowering the systematic bias in the data.
Analysis of the complexity and diversity of Super deepSAGE data across tissues
Rarefaction analysis of size-fractionated library for each sample was performed to determine the complexity and diversity of the tissues in pig [23]. The sequencing depth achieved using eight samples-multiplexed deep sequencing technic reached near-saturation of transcript discovery within all size ranges. Saturation was reached very early in Super deepSAGE sequencing data due to the lower complexity of the tags (number of tags) in libraries (Fig. 2A-F showed the first six deep sequencing runs). Samples from the same sequencing run were compared using reads from different size-fractionated libraries to further investigate the diversity of the relationship between sequencing depth and transcript discovery. In all deep sequencing runs, tissues exhibited transcriptome diversity in terms of both total numbers of reads and the number of transcripts discovered. For example, the muscle tissue (MS.DI_2) saturated much sooner than the conceptus (CPT.SPH_8) and have less number of transcripts discovered in the first deep sequencing run (Fig. 2A). Similar sequencing depth and diversity were obtained using size-fractionated data from each of the sequencing run and transcript as outcome measures (Supplemental Fig. SA-D).
Data quality and internal consistency control using principal component analysis (PCA)
Principal component analysis (PCA) was used to check if the samples clustered together according to their tissue source [24]. Even though the samples were collected from 32 individual animals from different families, genders, and ages (Table 1), the PCA plot showed that the samples from the same tissues clustered together and were distinct from other samples (Fig. 3). The transcripts in conceptus, blood, and macrophages had relatively distinct expression profile and segregation apart from the rest of the samples when plotted using the first two components of the PCA analysis (Fig. 3A). The adenohypophysis, cerebral cortex, heart, and muscle were aggregate and separated from other samples when plotted using the third and fourth component (Fig. 3B). The adrenal, liver, mesenteric lymph nodes, peripheral blood mononuclear cell, and spleen were slightly away from other samples when plotted using the fifth and sixth component (Fig. 3C). When removing those samples from the datasets and re-calculating the PCAs, the remaining samples; fat, placenta, endometrium, kidney, lung, and stomach grouped differently according to the tissue/cell types (Fig. 3D to F). Tissues having similar cellular composition and biological function, like alveolar macrophages and monocyte-derived macrophages or heart and skeletal muscles, clustered closely together but were separated from each other.
Comparison of the Super deepSAGE data with previously published microarray research
The expression profiles were compared with microarray data published previously [8]. There is a total of 18,306 common genes for seven tissues, while high correlations (r=0.85-0.93 and p-values less than 1.0×e-30) were calculated between the gene expression profiles generated by the two platforms (Fig. 4). Similar dynamic range was observed in both platforms for transcripts with relative expression level between 0.55 and 0.95. Differences in expression profiles were apparent between the two platforms with several genes exhibiting relatively higher or lower expression values in either platform deviated from the diagonal line (Fig. 4). All transcripts had an expression value in the microarray, due to background hybridization or noise, regardless of whether it was truly expressed or not. The overall dynamics of the fitted curve showed that the Super deepSAGE is more sensitive than that microarray for the low expressed genes showing a concaved trend at the lower ends (with relative expression level less than 0.55 in Fig. 4). For those genes with high expression levels, variability is high in both Super deepSAGE and microarray platforms.
As compared by microarray, reliable gene expression profiles can be generated by Super deepSAGE in seven known tissues. Of the 50 highest expressed Super deepSAGE tags, 38 (76%) found corresponding probesets in the 50 highest expressed genes, and only three tags showed a statistically significant difference between Super deepSAGE and microarray data. Two possibilities could cause such discrepancies between Super deepSAGE and microarray data: 1) the SAGE tag was derived from two or more different transcripts, which were differentially expressed in the samples tested, and 2) the microarray probeset can target two or more transcripts due to sequence similarity of transcripts. For example, the transcripts from the same gene family will always produce the same SAGE tag (attributable to the lower resolution power of Super deepSAGE) and preferred to hybrid to the same microarray probeset (can be minimized by design probesets in the none-conserved region). Regardless of some discrepancy, we conclude that Super deepSAGE data are overall compatible with the microarray data and provide faithful gene expression profiles.
Identification of tissue-specific expression of transcripts
A total of 4,165 transcripts showed significant up or down-regulation at least in one tissues in comparison to the average tag count for all other 27 tissues. K-means clustering was then performed by trying a different number of centers (K from 5 to 28) and several random sets (S from 10 to 1000). Finally, we selected K = 10 and S = 400 to produce clustering result with clean and clear expression pattern (by visualization), highly reproducible for each duplicated run (Fig. 5). The detailed clustering information is available in Supplemental document 2. The result indicated that Cluster 1 has the largest number of transcripts, and most of these transcripts were expressed low in tissues, except macrophages, PBMCs, blood, and conceptus which were moderately expressed. The conceptus specifically expressed transcripts were in Cluster 2, while the conceptus, macrophages, PBMCs, and blood de-expressed transcripts were in Cluster 4. The macrophages, PBMCs, blood, mesenteric lymph nodes, and spleen specific expressed transcripts were in Cluster 5. The genes specifically expressed in heart and skeletal muscle were in cluster 10. The cerebral cortex specifically expressed genes were in Cluster 6, and liver specifically expressed transcripts were in Cluster 7. The adrenal cortex, adrenal medulla, cerebral cortex, and adenohypophysis specifically expressed transcripts were in Cluster 8. Transcript in Cluster 3 and Cluster 9 were ubiquitously expressed or expressed in multiple tissues.
Gene expression data obtained from transcriptional profiling experiments have inspired several applications, such as the identification of differentially expressed genes [25, 26] and the creation of gene classifiers for improved diagnoses of diseases such as cancer [27, 28]. The gene expression profile of 224 samples created in this study is complicated that traditional models were difficult to apply to this data to find differentially expressed genes. An ad hoc method comparing each tissues to the average tag count for all other 27 tissues was performed, and a very stringent threshold was set (fold change >5.0, p-value <1.0×10-6) to filter the tissues specifically expressed transcripts. The K-means clustering algorithms which group similar transcripts and separate dissimilar transcripts by assigning them to different clusters have proven to be useful for identifying biologically relevant gene clusters for different biological status [29]. Even though very useful, the K-means clustering algorithm is particularly sensitive to initial starting conditions and converges to the point that is the local minimum [30]. Furthermore, the number of clusters (parameter K) is difficult to be determined. In this study, global-seeding procedures of BF98 [31] have been introduced into the algorithm to improve the consistency and quantity of clustering results. The BF98 method employed a bootstrap-type procedure to determine the initial seeds for the centers. Several subsamples (recommended n = 10) of the data set were clustered using K-means. Each clustering operation produced a different candidate set of centroids from which a new data set was constructed. This data set was clustered using K-means, and the centroids were chosen as the initial seeds. The optimal BF98 clustering result on the Super deepSAGE data was obtained by “visualization” of the result performed by using K=10 and number of subsamples S=1000 after trying K from 5 to 28 and S from 20 to 1,000. The “visualization” method is straightforward for that deterring the best parameter for the K-means clustering procedure, but when the K reached 10, definite, compact and representative gene clustering was formulated, and when the S is higher than 200, consistent clustering result was produced for each duplicated clustering run.
Identification of over-represented motif for tissues specifically expressed transcripts
The CLOVER software [32] with JASPAR PWM database [33] was used to identify over-represented transcription factor binding motifs for each cluster of genes. The promoter regions for each cluster of transcript (1,000 bp upstream) were obtained using the Ensemble Biomart tool [34]. The promoter regions for the whole transcript detected in this project, which possesses similar GC content, were used as background. Motifs having a p-value of ≤ 0.05 were selected as significant (Table 2, top 5 motifs). The most significantly enriched motif in Class 1 is MZF1. TFAP2A and TFAP2C were also significantly enriched with a raw score higher than 30. In Class 2, there was only one significantly enriched motif, RHOXF1. In Class 3 and 4, there were five and four motifs with p-value < 0.05 respectively, but the raw score was lower than ten. In Class 5, there were at least five motifs with p-value < 0.05, and three of them, RUNX1, ASCL1, and Myod1 had a raw score higher than 30. In Class 6, the significantly enriched motifs with the highest score were SNAI2 and FIGLA, whereas, in Class 7, the significantly enriched motifs with the highest score was NR4A2. In Class 8, there was only one motif ZEB1 enriched in the promoter region of these transcripts. In Class 9, all the enriched motifs had a raw score of less than ten. In Class 10, the top three motifs were Ascl2, Myog, and Tcf12.
The transcription factors interact with the DNA recognition motifs, regulates transcription of a large number of genes, and play important roles in fundamental biological processes, including growth, development, and disease [35]. To understand gene expression regulation in the Super deepSAGE data obtained in this study, identifying the over-represented or under-represented motifs in the sequence showing similar expression pattern and which factors bind to them, is necessary. Over-representation indicated the motif candidates playing a regulatory role in the sequences, while under-representation indicated that the motif would have a harmful dis-regulatory effect. In each gene clusters showing a similar expression pattern, Clover successfully detects motifs known to function in the sequences and generate interesting and testable hypotheses.
Case report: confirmation of the regulatory roles of RUNX1 in PBMCs in pig
Confirmation of the RUNX1 binding site in the promoter region of TLR-2, LCK, and VAV1
The toll-like receptor 2 (TLR-2), lymphocyte-specific protein tyrosine kinase (LCK), and vav1 oncogene (VAV1) plasmid containing the 1Kb promoter sequence were used in in vivo studies (wild type). To show the regulation effect of RUNX1, the binding site of RUNX1 in TLR-2, LCK, and VAV1 was mutated or deleted. Reporter vectors constructed by the wild type, mutated, or deleted promoter sequences were transfected into the peripheral blood mononuclear cells (PBMCs), and luciferase activity was monitored. Binding site deletion significantly attenuated the expression of the downstream reporter luciferase activity (p<0.05), indicating that the RUNX1 could interact with the target site and regulate the expression of the downstream reporter gene (Fig. 6A-C). The mutated vectors showed significant attenuation of the activity of downstream luciferase at 40, 44, and 48 hours post-transfection (p<0.05) indicating a regulatory relationship between RUNX1 and the targets. Another experiment was performed using mouse macrophage cells (RAW 264.7) to validate the hypothesis further. Consistent with the previous results, deletion/mutations to the RUNX1 binding sites in TLR-2, LCK, and VAV1 promoter sequence significantly attenuated the activity of downstream luciferase at 40, 44, and 48 hours post-transfection (Fig. 6D-F). The luciferase reporter activity after transfection with the wild-type vector was significantly higher in macrophage cells than in the PBMC assays, suggesting that the endogenous RUNX1expression in mouse macrophage cells was higher than in PBMCs.
RNA flow cytometry analysis of RUNX1 targets in LPS and RUNX1 inhibitor treated PBMCs
To show the effect of RUNX1 on three targets; TLR2, LCK, and VAV1, pig PBMCs were stimulated with LPS and/or RUNX1 inhibitor, for 6 hours, during which their TLR2, LCK, VAV1, CD14 protein levels were monitored. Two subsets of cells readily emerged from CD14/TLR2 analysis in PBMCs: a CD14hi/TLR2lo (CD14high/TLR2low) and a CD14lo/TLR2lo population (Fig. 7D). The percentage of CD14hi/TLR2lo cells increased in LPS plus RUNX1 inhibitor treated samples, but the proportion of CD14lo/TLR2lo cells remained unchanged. The percentages of TLR2hi (for both CD14hi and CD14lo) cells increased seven-fold in LPS alone treated samples compared with the non-treated controls. Four subsets of cells readily emerged from CD14/LCK analysis in PBMCs treated with LPS or RUNX1 inhibitor: a CD14hi/LCKlo, CD14hi/LCKhi, CD14lo/LCKhi, and CD14lo/LCKlo population (Fig. 7E). The percentage of CD14hi/LCKhi, and CD14lo/LCKhi cells increased in LPS plus RUNX1 inhibitor treated samples, and the proportion of CD14hi/LCKlo cells was decreased. The percentages of CD14hi/LCKhi cells increased by 40% in LPS alone treated samples compared with the non-treated controls. Two subsets of cells readily emerged from CD14/VAV1 analysis in PBMCs: a CD14hi/VAV1lo and a CD14lo/VAV1lo population (Fig. 7F). The percentage of VAV1hi (for both CD14hi and CD14lo) cells increased four-fold in LPS plus RUNX1 inhibitor treated samples. The percentages of VAV1hi (for both CD14hi and CD14lo) cells increased seven-fold in LPS alone treated samples compared with the non-treated controls and is two-fold higher than in LPS plus RNUX1 inhibitor treated samples.
RUNX1 is a master regulator of hematopoiesis and plays a vital role in T and B cells development. RUNX1 is critical in inducing the production of genes in immune cells, such as interleukin-2 (IL-2, [36], IL-3 [37], colony-stimulating factor 1 receptor (CSF1R, [38], CSF2 [39], and cluster of differentiation 4 (CD4, [40]. However, its roles in LPS-mediated inflammation in PBMCs remains unclear. In this study, regulations of TLR-2, LCK, and VAV1 have been confirmed by flow cytometry. TLR2 is an essential receptor for the recognition of a variety of pathogen-associated molecular pattern (PAMPs) from Gram-positive bacteria, including bacterial lipoproteins, lipomannan, and lipoteichoic acids [41]. LCK encoded protein is a key signaling molecule in the selection and maturation of developing T-cells [42]. The VAV1 encoded protein is important in hematopoiesis, playing a role in both T-cell and B-cell development and activation [43, 44]. These results suggested that RUNX1 might be a new potential target for resolving inadequate or uncontrolled inflammation in PBMCs.
Real-time PCR analysis of RUNX1 targets in LPS and RUNX1 inhibitor treated PBMCs
To investigate if the expression patterns of the 23 RUNX1 target genes could be modeled by LPS and RUNX1 inhibitor treatment in vivo, we performed real-time PCR assay after treating PBMCs with two different doses of LPS (1 ng/mL, 10 ng/mL), and RUNX1 inhibitor (1 ng/mL, 10 ng/mL). Samples were collected six hours post-stimulation. A total of 21 genes were induced in response to at least one dose of LPS stimulation, as expression levels for these genes were different when compared to non-stimulated control. A total of 10 genes were down-regulated in response to the RUNX1 inhibitor treatment. Hierarchical clustering analysis was used to determine whether the response of LPS stimulation response was similar to the patterns detected in RUNX1 inhibitor treatment, and if any differences were observed depending on the dosage of LPS and RUNX1 inhibitor used. As shown in Fig. 8, the expression patterns of samples with RUNX1 inhibitor treatment, RUNX1 inhibitor plus LPS treatment, and non-simulated controls clustered together. Different dose of the RUNX1 inhibitor did not affect the samples, as observed by the mixing up of respective samples in the heatmap. The LPS treated samples were unique and were separated from the RUNX1 inhibitor-treated groups and control groups. Similar to the RUNX1 inhibitor, different doses of LPS dose did not affect the samples as well. The expression patterns of RUNX1 inhibitor plus LPS treatment samples were similar with controls and samples treated with RUNX1 alone because they were mixed in the heatmap.
Super deepSAGE is a useful data resource in pig study
Gene expression analysis is extensively applied in the understanding of the molecular mechanisms underlying a wide range of biological process such as host-pathogen interactions. Our dataset of transcript levels in normal tissues was developed as a reference datasets that can be compared to attained information of biological event specifically related aberrations in transcript levels. Therefore, one major focus of this manuscript was to demonstrate the biological importance of these profiles. We report that >40% of the measured transcripts were differentially expressed between the different tissues. We show that statically the transcripts were co-regulated by a few important transcription factors. We describe one of the many transcription factors that regulated gene expression in PBMCs. To our knowledge, this data set is the largest to date for the analysis of transcriptional profiles within normal tissues from pigs and is complementary to previously published data sets. These data will improve the annotation of the pig genome, support versatile biological research, and increase the utility of the pig as a meat source animal and model in medical research.