Bovine, Porcine and Feline Tritrichomonas foetus transcriptomics overview
As we have previously mentioned, genetic differences between T. foetus strains are consistent, but not sufficient to define as different species the bovine, porcine and feline strains. Moreover, those differences are not enough for understanding the T. foetus adaptive capacity to different hosts. In this work, we hypothesized that the ability of T. foetus to adapt to different hosts could be explained at transcriptional regulation level. In this sense, we conducted a mapping experiment employing three available transcriptomics datasets for T. foetus strains: BP-4 (bovine), PIG30/1(porcine), G10/1 (feline) (13, 14); against the available public genome for T. foetus (bovine K1 strain (15). Our results demonstrated a high mappability rate of the three strains against T. foetus K1 strain (Table 1) and thus, a close relationship among the different strains at transcriptomic level with the bovine K1 reference strain.
Next, we performed a guided transcriptome assembly (K1 strain as reference) and we reconstructed a common transcriptome for each one of T. foetus strains analyzed (BP-4, G10/1 and PIG30/1). We obtained an assembly of 29,361 isoforms (contigs), which represents a total of 26,284 genes for the three strains. Taking into account that we used as reference the K1 assembly, differences between previously de novo assemblies are clear (Table 1). Finally, we highlight that the T. foetus K1 assembly showed 25,336 genes, and we were able to obtain 948 sequences assemblies that could be new genes sequences. Afterwards, by a Principal Component Analysis (PCA), we demonstrated that BP-4 and PIG30/1 strains were grouped and distant from G10/1 strain; concluding that similarities between bovine and porcine strains, at the gene expression level, are high and not related to feline strain gene expression (Fig. 1). In this context, expression data for each gene were plotted for each strain against the other, which exhibited clear differences between expressing genes for the different strains (Additional file 2:Figure S2A). Once again a close relationship between BP-4 and PIG30/1 strains arises, while a marked difference in gene expression level between G10/1 and their counterparts is evident.
Table 1
Statistics from the Tritrichomonas foetus transcriptomes
|
K1(ref.)
|
PIG30/1a
|
BP-4b
|
G10/1b
|
Assembly size (nt)
|
51862250
|
47094268
|
37882427
|
29525551
|
Contigs (n°)
|
29361
|
43308
|
42363
|
36559
|
Largest
|
21237
|
17203
|
14314
|
17195
|
Shortest
|
71
|
201
|
201
|
201
|
Average
|
1744.38
|
1087
|
895.25
|
806.61
|
N50
|
2454
|
1503
|
1259
|
1178
|
Map vs K1(%)
|
-
|
95.68
|
96.65
|
91.36
|
K1(ref.): guided assembly performed in this work using Tritrichomonas foetus K1 genome as reference. aSummary of transcriptome statics from Morin-Adeline et al.; 2015 (14), bSummary of transcriptome statics from Morin-Adeline et al.; 2014 (13). |
Expression patterns in T. foetus strains transcriptomics
Considering the difficulty of analysis of the great amount of data obtained, we hypothesized that reducing the dimension of the data could help us to explore differences in a search of gene expression patterns that could be related with each strain in a different context. Since high throughput technologies have generated large quantities of data available in recent years, determining expression patterns from the resulting datasets could be problematic without a previous dimension reduction procedure. In this sense, computational clustering methods are used to improve exploration strategies by reducing the complexity of the data sets (31). Moreover, gene expression clusters are composed of similar function categories, a particularity that could be exploited to infer the functionality of uncharacterized genes that are part of the same cluster (32).
In order to reveal the patterns of gene expression for each T. foetus strains, we performed a hierarchical clustering procedure that let us reduce our data from 26,284 genes to 690 clusters of genes for each strain. A heatmap plot splitted by dendrogram in 5 groups allowed us to demonstrate clear differential patterns. In concordance with previous data, PIG30/1 and BP-4 strains appear grouped and separate from G10/1 strain (Fig. 2, cluster composition can be found in Additional file 4:Table S3). As can be seen, characteristic patterns of highly expressed clusters of genes arise from G10/1 strain at clusters B, C and D. On the other hand, cluster patterns for PIG30/1 and BP-4 strains were part of section A and B. Is possible to see that patterns in groups B, C and D appeared to be the most particular. In fact, while gene clusters in section B appeared to be greatly expressed in the three strains (with some particular differences), gene clusters in section C and D were more expressed in G10/1 strain (Fig. 2 and Additional file 2:Figure S3). In concordance with those observations, we explored the three sections with the aim of describing the gene composition of each one.
A closer inspection over section B revealed the presence of a gene product related to cathepsin L-like cysteine peptidase in cluster 498 (Fig. 2). This gene product is homologous to TfCP8 protein of T.foetus F2 strain (gene bank: X87781.1) (33). Additionally, other genes products related to proteases were present, such as TRFO_22235 at cluster 520, TRFO_09351 in cluster 189, TRFO_43126, and TRFO_29624 (both at cluster 389; see Additional file 4:Table S3). In cluster 316 we found the TRFO_05369, another putative cysteine protease; and in this cluster, also we observed two putative malate dehydrogenase enzymes, a 40S ribosomal protein, and a hydrogenosomes membrane protein precursor. In addition, we have been able to identify a gene TRFO_27838 that codifies a putative precursor of adhesin AP65-1 in cluster 66 and another putative adhesin (TRFO_07867) in cluster 619. This type of adhesins possesses a role as moonlighting proteins (a subclass of multifunctional proteins previously documented in trichomonads parasites with a role in pathogenesis and adaptation to host cells) (34). Interestingly, at cluster 140, a gene that encodes for a tetraspanin protein (TRFO_34204) was found grouped with ribosomal proteins and rubrerythrin gene, a protein related to oxidative stress protection (35).
Section C showed clusters formed of genes related to pathogenesis and moonlighting proteins. We identified genes associated with homeostasis regulation (thioredoxin, HSP90), amino acid metabolic process (aminotransferases, aminopeptidases, PDXDC1, protein phosphatase 2C1, protein serine/threonine phosphatase, cysteine synthase), and nitroreductase family genes (related to nucleotide metabolism). As well as other genes involved in Ca2+ signaling machinery (EF-hand family protein and CBL-interacting serine/threonine-protein kinase 8), synthesis of phospholipids (myosin cross-reactive antigen, inositol 3 phosphate synthase) and the GDP-L-fucose biosynthesis via de novo pathway (GDP mannose 4,6 dehydratase). Finally, we demonstrated the presence in cluster 592 (Fig. 2) of a gene that codify for a Clan CA, family C1, cathepsin L-like cysteine peptidase homologous to the documented cysteine protease gene from F2 strain (CP7, genbank:X87780.1) (33).
In section D, as in section B, we detected genes associated with oxidative stress defense such as superoxide dismutase, rubrerythrin, thioredoxin, peroxiredoxin, pirin, OsmC (osmotically inducible protein C), and quercetin 2,3-dioxygenase. As well as, NifU-like domains containing genes and genes related to homeostasis regulation (HSP90-2 and HSP71). Also, we identified genes related to energy metabolism (malate dehydrogenase, glucose 6 phosphate 1 dehydrogenase, NADP dependent isopropanol dehydrogenase, pyruvate decarboxylase isozyme 3/, glyceraldehyde 3 phosphate dehydrogenase, alcohol dehydrogenase iron-containing family protein, glutamate decarboxylase, glucokinase 1, transketolase family protein and fumarate hydratase class II) and lipid transport (Lipid A export ATP binding/permease protein MsbA). In these sections, genes related to signalling (CaMK, ABC transporter family protein, MFS or major facilitator superfamily transporter, small GTP-binding protein; V type proton ATPase subunit B and V-type proton ATPase 16 kDa proteolipid subunit, Ubiquitin-conjugating enzyme E2 and a saposin-like gene) were also identified.
We were also able to detect genes associated with regulation of enzyme activities and/or gene expression (Adenylate and Guanylate cyclase), regulation of transcription (Myb like DNA-binding domain-containing gene), and translation (initiation factor eIF-5A gene family). Finally, ribosomal genes (40S ribosomal S17-B, 40S ribosomal S27, 60S ribosomal L19, 60S ribosomal export gene NMD3), cell division cycle gene 48; and proteases such as a cysteine proteinases (CPs): Clan CD, family C14, metacaspase like cysteine peptidase, Clan CD, family C13, asparaginyl endopeptidase-like cysteine peptidase were also grouped at this section. In conclusion, our clustering analysis was able to reduce the great dimensionality of the transcriptomic data and let us find clusters of genes with potential roles in adaptation and pathogenesis, in addition was possible to observe characteristic patterns of gene cluster expression for each strain.
Cysteine Proteases (CPs) expression patterns in T. foetus strains
In trichomonads, cysteine proteases have been implicated in the adherence to host cells, cytotoxicity, nutrient acquisition and the evasion of the host immune response (36). Additionally, it has been reported that such proteases are differentially expressed among T. foetus strains. Morin-Adelin et al (14) documented that cysteine protease 7 (CP7) was more expressed in the G10/1 strain, in contrast to cysteine protease 8 (CP8), which is more expressed in BP4 and PIG30/1 strains. Taking into account this observation, we performed a homology search in our transcriptome assembly with the aim of finding all the possible genes related to the curated protease database MEROPS (29). A blastx analysis gave us a total of 520 proteases (Additional file 5: Table S4), then we mapped these results against our clustering analysis in order to find clusters that contained those proteases, and thereby confirm their expression in the T. foetus strains.
The gene proteases were distributed in 194 clusters of the 690 in our agglomerative analysis (28%), which tells us that a great percent of gene clusters that contain almost one gene that codifies for a putative protease; demonstrating the great distribution of these proteins in the T. foetus genome. Particularly, we searched for new proteases that could belong to the CPs family, and our blastx results demonstrated the existence of a total of 265 putative CPs (~45%); of them, 132 correspond to hypothetical proteins (~50%) in the K1 reference genome. Additionally, we obtained five new sequences related to CPs as a result of our assembly method applied that were not present in K1 genome reference (Additional file 5: Table S4, genes with only Assembly id). This family of proteases (CPs) was represented in 130 clusters of the 192 clusters where proteases were found (~68%). This observation showed the great representation of CPs in the total of proteases expressed by T. foetus, in the three strains. In addition, it is important to mention the existence of a great variety of genes in most clusters where the CPs were present, and as has been largely documented, genes in the same cluster could be co-expressed and participate in similar processes (31).
In this context, studying particular differences between strains at clusters composed by CPs could give us more information about adaptation to the host and the pathogenesis process. As shown in Fig. 3A patterns of CPs clusters arise, revealing differences between strains. One of those differences has been documented and involves a cathepsin L-like cysteine peptidase homologous to TfCP7 from F2 strain (genbank: X87780.1; TRFO_22235; cluster 592) and TfCP8 (TRFO_20864, cluster 498), both previously mentioned in this work.
Then, we searched for clusters with similar behavior to TfCP7 (592) and TfCP8 (498) at each strain to propose new gene targets related to these pathogenic factors and the processes where those could participate. We highlighted five clusters related to TfCP7(592) behavior (Fig.3B): cluster 228 that contains a cathepsin L-like endopeptidase gene termed “crustapain” (TRFO_20787) in addition to six more genes, two of them termed as hypothetical proteins and one new gene product as result of our assembly. It highlights the presence of a gene that encodes the alpha amylase, a CAMK kinase protein and GTPase activator protein (Additional file 4:Table S3). Cluster 315 contains 21 genes including a pro-cathepsin H (TRFO_30800) and a Clan CD, family C14, metacaspase-like cysteine peptidase (TRFO_39208) with a CAMK family protein. A numerous genes were termed as hypothetical, while the rest were related to GTPase activity (Additional file 4:Table S3). Cluster 345 contained 4 genes, including a ubiquitin-specific peptidase 21 orthologue (TRFO_42709), the cluster 471 was composed by 4 genes including a Clan CD, family C14, metacaspase-like cysteine peptidase (TRFO_19395), and finally, the cluster 545 included seven genes including an asparagine endopeptidase of the family C13 (Legumain;TRFO_32118). Once again a CAMK family protein stands out sharing the cluster with CPs.
On the other hand, five clusters are highlighted as related to CP8 (498) behavior (Fig.3C), those were: cluster 144, composed of six genes including cathepsin K (TRFO_42629) and crustapain (TRFO_02020) that was confirmed by us as the homologous to CP5 from T. foetus F2 strain (Genbank: X87778.1) (33). Cluster 280 included 14 genes such as cathepsin L-like cysteine peptidase annotated as “cysteine protease 8” (TRFO_16156) homologous to TfCP6 from F2 strain (X87779.1), in addition to an EF-hand protein and CAMK family protein. Cluster 521 with three genes, including a papain homolog annotated as “Oryzain alpha chain” (TRFO_07604), cluster 602 was composed only by a Clan CA, family C1, cathepsin L-like cysteine peptidase (TRFO_30793) as the cluster 625 that only included a cathepsin L1 (TRFO_08265), homologous to the TfCP9 from F2 strain (X87782.1). As can be seen CPs, shared clusters with a great variety of genes that are co-expressed. While a great proportion was termed in reference as hypothetical protein, others are related to CAMK family and EF-hand proteins, those gene products were documented as important factors for pathogeny and migration in other parasite models (37). In this sense our in-depth analysis is useful for studying large families of genes and proposing new gene targets related to pathogenesis.
Functional analysis of differentially expressed genes (DEGs)
Since our exploratory analysis revealed characteristic patterns of gene expression for each strain, we decided to perform a more strict analysis to investigate differences at the gene expression level between strains to map those differences to biological processes or molecular functions that could explain strains adaptability.
To achieve the above objective we performed a differential expression (DE) analysis between strains. The analysis was conducted as follows: G10/1 vs PIG30/1; BP-4 vs PIG30/1 and BP4 vs G10/1, significant DEGs were listed in Additional file 1:Table S1. Volcano plots (Additional file:Figure S2B) summarize the results of the analysis, as can see for G10/1 vs PIG30/1 we observed 157 genes upregulated for G10/1 strain (downregulated for PIG30/1 strain) while another 445 genes were downregulated (upregulated for PIG30/1 strain). For BP-4 vs PIG30/1 analysis, we observed 52 genes upregulated for BP-4 strain (downregulated for PIG30/1) and 79 genes downregulated (upregulated for PIG30/1). Finally for BP-4 vs G10/1 367 were upregulated for BP-4 strain (downregulated for G10/1) and 203 were downregulated (upregulated for G10/1 strain). Initially, we observed that the major differential expressed genes (DEGs) arise when G10/1 strain was compared against the PIG30/1 or BP-4 counterpart, while in BP-4 vs PIG30/1 analysis the number of DEGs was much lower (Additional file:Figure S2B).
Since we conducted a guided transcriptome assembly using as reference Tritrichomonas foetus K1 draft genome, gene annotations were taken from the reference. In this context, genes that were not present in T. foetus K1 annotation were documented in our assembly (see Additional file 4:Table S3, genes with only Assembly_id). Taking into account that T. foetus K1 strain genome remains poorly annotated, we implemented a strategy for to annotate most quantity of genes present in our assembly (and not present in T. foetus K1 assembly) and assigned them putative functions by hidden Markov models (see material and methods). As a result, possible functions to genes that were part of the DEGs between strains were assigned (Additional file 6: Table S5). Next, we map the gene ontology terms related to the annotated DEGs. As can be seen in Fig. 4, at the three comparisons, biological processes were common between up and downregulated genes. A major percent of DEGs were associated to protein phosphorylation processes (GO: 0006468) and we observed that these proteins were related to kinase domains (PF00069) and tyrosine and serine/threonine kinase domains (PF07714). It is important to highlight that even though this process is the most represented in the three comparisons, the genes that are related to this process are not the same, suggesting that each strain uses different groups of kinases at the different contexts (Fig. 5A).
An inspection over G10/1 vs PIG30/1 comparison revealed that chromatin remodeling process (GO:0006338) was presented in upregulated and downregulated genes, particularly by those which share sequences related with a c-Myb DNA binding domain. Moreover, carbohydrate metabolic (GO: 0005975) and glucose metabolic (GO: 0006006) processes were more represented in upregulated genes. In the downregulated genes process were more represented those involved in the regulation of transcription (GO: 0006355), regulation of DNA replication (GO: 0006275) and DNA replication (GO: 0006260) processes. Molecular functions shared between upregulated and downregulated genes were related to kinase activity (GO: 0004672), GTP binding (GO: 0005525), and GTPase activity (GO: 0003924). Only in upregulated group, we observed genes related to calcium ion binding (GO:0005509), and those related to EF-hand domain-containing proteins (PF00036), a Ca2+ ion binding domain essential for cell signaling (10.1042/BJ20070255). Finally, protein binding function (GO:0005515) appears represented only in downregulated genes due to the presence of genes that codify for Leucine rich repeat proteins (PF13855), that are known to participate in protein-protein interaction (38).
For BP-4 vs PIG30/1 DE analysis results showed that genes that are up and down regulated were involved in the carbohydrate metabolic process, this is due to the differential expression of genes that codify for Glycosyl hydrolases (PF04616, PF01055) related to molecular function of hydrolase activity, hydrolyzing O-glycosyl compounds (GO:0004553), and a glucose-6-phosphate 1-dehydrogenase family protein (PF02781) that participate in addition at glucose metabolic process (GO:0006006) and oxidation-reduction process (GO:0055114). Transmembrane transport (GO: 0055085) was an upregulated process, since the differential expression of a gene that codifies for a major facilitator superfamily protein or MFS (PF07690). Interestingly, MFS proteins facilitate the transport across the cytoplasmic or internal membranes of a variety of substrates including ions, sugar phosphates, drugs, neurotransmitters, nucleosides, amino acids, and peptides (39). On the other hand, translational initiation (GO:0006413), chromatin remodeling and DNA replication were downregulated processes showing that those processes, related to growth and cell replication, were more active in the PIG30/1 strain. Molecular functions like GTP binding and GTPase activity were upregulated in contrast to nucleic acid binding (GO: 0003676) and hydrolase activity (GO: 0004553) that were downregulated processes.
Finally, BP-4 vs G10/1 analysis revealed common processes between up and downregulated genes such as chromatin remodeling and regulation of transcription DNA-templated (GO: 0006355). Cell redox-homeostasis (GO: 0045454) was an upregulated process since thioredoxin genes were found upregulated. Also, regulation of DNA replication was upregulated, since proliferating cell nuclear antigen (PCNA) genes were differentially expressed. Moreover, the signal transduction (GO: 0035556) and carbohydrate metabolic process were downregulated processes. An interesting observation, as was shown above for DE kinases, was that the genes related to the chromatin remodeling process (GO: 0006338) were not the same in each comparison, opening questions about gene regulation at different contexts for T. foetus strains (Fig. 5B).
Next, all the annotated DEGs were mapped in our clustering analysis with the aim of extending our annotation procedure. We observed that a great proportion of DEGs (at the three comparisons) were included in clusters of section E in our heatmap representation, with poor representation of clusters that are part of sections A,B and D, and none of the genes from clusters of section C were differentially expressed (no significant). Cluster 134 (section E), with 43 genes, was the most represented cluster in G10/1 vs PIG30/1 comparison when upregulated genes are explored. The majority of the DEGs were involved in processes such as chromatin remodeling and phosphorylation, and also a papain family cysteine protease (TRFO_23170) was highlighted. This was the only CP that was significantly expressed between strains in our analysis and is the homologue to the CP1 from T. foetus D1 strain (U13153.1) (40). It is important to highlight that, in cluster 134, our analysis grouped three genes related to CPs, six genes related to Myb proteins and eleven genes related to proteins that participate in phosphorylation processes (Additional file 4:Table S3). This observation is consistent with our hypothesis that these three groups of genes participate in a common process for adaptation in a parasite's environment: protein kinases integrating signals from context, Myb proteins regulating the gene expression for host colonization and CPs that participates in virulence and adherence to host.
On downregulated genes, cluster 21 (Section E) was the most represented with 89 genes, and were included particular process as cell redox homeostasis (GO: 0045454), intraciliary transport (GO: 0042073), vacuolar transport and regulation of autophagy (GO: 0010506), although phosphorylation process (GO: 0006468) and GTPase activity (GO: 0003924) function were the most represented. Once again, genes related to Myb proteins, protein phosphorylation and CPs are highlighted.
For BP-4 vs PIG30/1 upregulated genes analysis, cluster 124 (Section E), contributed with 18 genes related to phosphorylation processes (GO: 0006468) and with kinase activity (GO: 0004672), while cluster 494 (Section E), in downregulated genes, was the most represented with 16 genes. Finally, cluster 21 was the most represented (59 genes) at upregulated genes for BP-4 vs G10/1 comparison, while for downregulated genes cluster 154 (section E,37 genes) was highlighted with different process such as calcium binding (GO:0005509), regulation of transcription (GO:0006355), transmembrane transport (GO:0055085), vesicle-mediated transport (GO:0016192) and phosphorylation (GO:0006468). Interestingly, the clusters described above are composed of a great proportion of genes that codify for hypothetical proteins (i.e. for cluster 134, ~50%), in this sense our analysis could contribute to functional annotation of genes.