1. Sequence divergence and genomic organization of cruzipain genes in different T. cruzi strains
Genomic analyses of T. cruzi CL Brener (DTU-TcVI), Dm28c (DTU-TcI), and YC6 (DTU-TcII) strains based on PacBio and Nanopore sequencing, together with the current assembly of the CL Brener strain available at TriTrypDB.org, which was based on Sanger sequencing, revealed two regions in the same chromosome containing cruzipain gene clusters in the genomes of all three parasite strains. For the CL Brener strain, which has a hybrid genome, we show the location of the two clusters in both haplotypes that were separately assembled (Esmeraldo-like, EL and non-Esmeraldo-like, NEL) (Figure 1). Although high synteny levels with flanking genes were observed, significant copy number variation within each cluster between each genome was also detected, as shown in Figure 1. It is noteworthy that copy number variation was also observed when the two CL Brener haplotypes were compared. According to the assembly available in TriTrypDB, the two clusters are located on chromosome 6 of the CL Brener strain and on chromosome 35 of the YC6 strain (Figure 1). The number of gene copies in each cluster is depicted in Table 1.
Table 1- Number of cruzipain gene copies per cluster in Dm28c, CL Brener, and YC6 T. cruzi genomes.
Strain/haplotype
|
Cluster I
|
Cluster II
|
Cluster III
|
Complete copies
|
Truncated copies
|
Complete copies
|
Truncated copies
|
Complete copies
|
Truncated copies
|
Dm28c
|
14
|
1
|
7
|
3
|
-
|
-
|
CL Brener EL
|
5
|
5
|
2
|
3
|
-
|
-
|
CL Brener NEL
|
10
|
8
|
3
|
8
|
-
|
-
|
YC6
|
14
|
4
|
6
|
2
|
6
|
1
|
In the genomes of all three strains, the first region or cluster (Cluster I) is flanked by the genes encoding “cysteinyl-tRNA synthetase” and “hydroxyacylglutathione hydrolase” . In the Dm28c genome, Cluster I contains fourteen nearly identical complete cruzipain copies in tandem (>99% protein sequence identity) and one truncated copy of the gene (Table 1). In CL Brener, Cluster I contains five tandem complete gene copies in the EL haplotype and ten copies in the NEL haplotype. In the EL haplotype, there are five truncated genes, while there are eight in the NEL haplotype. As observed for Dm28c, in both CL Brener haplotypes, all cruzipain sequences in Cluster I have >99% predicted protein sequence identity (Supplementary Table S1). Cluster I in the YC6 genome contains fourteen complete copies and four truncated cruzipain copies. The complete copies are almost identical, with >99% protein sequence identity (Supplementary Table S1).
In all three strains, the second cluster (Cluster II) is flanked by the “UDP-Gal glycosyltransferase” gene and by the “pumilio/PUF RNA binding Protein 4” gene . In Dm28c, Cluster II contains nine copies in tandem, of which seven encode complete cruzipain sequences sharing >91% protein identity, and <89% protein identity with the sequences of Cluster I, while the remaining two copies are of truncated genes (Figure 1, Supplementary Table S1). In Dm28c, a tenth truncated single gene copy is located further down from Cluster II. In the CL Brener EL haplotype, Cluster II has five copies of cruzipain, whereas in the CL Brener NEL haplotype, there are eleven copies. Remarkably, only two and three copies in the EL and NEL haplotypes, respectively, encode full-length protein, and therefore, the CL Brener potentially encodes five functional cruzipains (Table 1, Supplementary Fig. 1). In YC6, Cluster II has six complete cruzipain copies with >99% protein sequence identity, as well as two truncated copies.
Notably, the YC6 strain has a third cluster containing six complete cruzipain copies, flanked by two “Hypothetical ORFs”, a “UDP-Gal glycosyltransferase” gene and ‘trans-sialidase” gene, which is located to the left of the other two aforementioned clusters. Sequences in Cluster III share >96% protein sequence identity. This cluster also contains one truncated copy that is more similar to cruzipain sequences found in Cluster II (Figure 1, Table 1 and Supplementary Table S1).
The analysis of the truncated cruzipain gene sequences from the three strains revealed a great variety of frameshifts that led to truncations in different domains of the protein (Supplementary Fig. 1). Full-length cruzipain is constituted of a signal peptide (pre-region), a prodomain that is required for protein folding and is released from the zymogen upon enzyme maturation/activation, the central/catalytic domain that is responsible for enzymatic activity and a C-terminal extension2 (Supplementary Fig. 1 and Figure 3A). In each strain, the truncation in one gene copy located in Cluster I localizes nearly at the end of the central (catalytic) domain (Val213), which could potentially result in the expression of a functional enzyme lacking the C-terminal extension. All other truncated gene copies, in the three parasite strains, lack significant portions of the gene sequence and the resulting products should not have protease activity. However, the great majority of truncated cruzipain genes, at least so in Dm28c and YC6 strains, retain intact the signal peptide and the prodomain, which could result in the production of proteins without enzymatic activity, but retaining a functional prodomain, that acts as an inhibitor of cruzipain, as described25.
The intergenic regions between cruzipain copies are highly similar within a cluster, sharing between 95-100% nucleotide identity within the tandem repeats of Cluster I when the three strains are compared (Supplementary Table 2, Supplementary Fig. 2). Likewise, intergenic regions from YC6 strain Cluster III are very similar to Cluster I sequences. The intergenic regions within the tandem copies of Cluster II of the three strains are also highly similar, sharing >93% identity. Noteworthy, the intergenic regions of Cluster II are 62 nucleotides shorter than those of Cluster I, due to a deletion in the middle of the intergenic sequence. Interestingly, the deletion in Cluster II is observed at the same location in all three parasite strains (Supplementary Fig. S2).
2. Cruzipain sequences can be grouped into two families and four sub-types
To evaluate the variability within cruzipain sequences, we performed an alignment of the predicted protein sequences for all full-length copies (sequences with the prodomain, central domain, and C-terminal extension) (Supplementary Table S1), as well as alignments of each domain separately (Supplementary Tables S3-S5). This analysis revealed that, while the regions corresponding to the preprodomain (corresponding to the signal peptide and prodomain, residues -122 to -1) and the C-terminal extension (residues 216 to 345) are virtually identical among them (>93% and >90% sequence identity, respectively), the central domain (residues 1 to 215) displays a higher degree of divergence (74-100% identity) (Supplementary Fig. S3 and Supplementary Tables S3-S5). As described previously, the C-terminal extension is not required for enzymatic activity, while the central domain possess all the residues required for proteolytic activity2.
Phylogenetic analyses based solely on the central domain of all protein sequences indicated that cruzipain genes could be grouped into two major families that correlate with their distribution in each cluster (Figure 2). Sequences from Cluster I are highly similar (>97% identity) to cruzipain 1/cruzain2, with the CL Brener EL sequences grouping with YC6 sequences, while the CL Brener NEL sequences grouped with those of Dm28c. Sequences from Cluster II are more similar to the cruzipain 2 isoform previously described16, with the same grouping pattern being observed. Based on these analyses, we propose the classification of cruzipain sequences into two families: Family I, comprising the sequences from gene Cluster I, and Family II, comprising the sequences from gene Cluster II. Interestingly, in the YC6 strain, the third cluster contains copies of Family I and one truncated copy of Family II (indicated by the arrows in Figure 2). These sequences from Cluster III group with the other Cluster I sequences.
The czp1 central domain sequences present in the different alleles of CL Brener, EL and NEL, share >98% protein sequence identity. When czp1 sequences were compared between strains, a high level of conservation was maintained, i.e., 97% identity between czp1 from Dm28 and CL Brener, and 93% identity between Dm28 and YC6 strain (Supplementary Table S5).
In contrast to the high sequence identity observed within Family I sequences, greater divergence was found among the cruzipain sequences of Family II. The sequence that was previously described as the czp2 isoform was identified in Cluster II of Dm28c16,17, as well as sequences that were further divergent from czp1 and czp2. Considering this high level of diversity, we asked if cruzipain sequences of Family II could be further divided into sub-types according to the substitutions of residues in the catalytic domain. Alignment of cruzipain DNA sequences of both families from Dm28c identified twenty-three SNPs (data not shown) and twelve positions with substitutions of two or more sequential amino acid residues in the catalytic domain (Supplementary Fig. S3). A larger number of amino acid substitutions was identified in the catalytic domain of cruzipains belonging to Family II. Furthermore, we could easily group cruzipain copies of the different strains according to the residues displayed at sequence divergence spots. Figure 3 shows a comparison of these sequences with a particular emphasis on motifs that are important for enzyme activity as previously defined for papain-like cysteine proteases and based on structural data of cruzain co-crystallized with ligands26–28.
We analyzed further the amino acid substitutions in the central/catalytic domain (Ala1 to Gly215) among cruzipain sequences (Figures 3A and 3B). Based on their structural importance, we initially considered loops containing residues 67-70 and 158-162, located at the interfaces between protease subsites S2/S3 and S2/S1’ (Figure 3C). We refined the positions within these motifs that could be considered in the definition of sub-type signatures, based on residues that frequently interact with cruzain inhibitors in crystallographic complexes (further details later described in item 3 and Table 2). Considering frequent protein-ligand interactions, the following positions were determined to be signatures for cruzipain sub-types: 67, 68, 138, and 208, located in the S2 pocket; 145, in the S1’ region; and 159 and 161 in the interface between S2/S1’ sites. Importantly, residue 208 is crucial in conferring dual cathepsin-L and cathepsin B-like specificity to czp129,30, while non-conservative substitutions in residues 68-70 were previously hypothesized as determinants for the different S2-specificity of cruzipain 217,18. Based on the comparison of these positions, we proposed the division of cruzipain Family II into three sub-types: cruzipain 2, cruzipain 3, and cruzipain 4 (Figure 3C). These signatures are found in multiple T. cruzi strains, namely CL Brener, Dm28c, YC6, and TCC (Supplementary Figures S3 and S4). It is worth mentioning that the modifications are consistently observed in the same regions, in all the sequences, with each sub-type having its conserved pattern. Since there are few random modifications, we usually observed the same modification in the same sub-type of all analyzed strains.
The distribution of the gene copies belonging to each proposed sub-type in the clusters of the three strains analyzed is depicted in Figure 4. Cluster I/Family I is flanked by the same genes in Dm28c, YC6, and both CL Brener haplotypes. In contrast, Cluster II/Family II is more heterogeneous among the strains. While Dm28c and YC6 strains show a cluster with all cruzipain copies in tandem and one truncated copy further ahead in Dm28c, both alleles of CL Brener show a larger number of truncated copies, as well as single truncated copies spread nearby in the chromosome (white rectangles in Figure 4). A detailed description of truncated copies is depicted in Supplementary Fig. 1). We could not classify the sub-types of most of the truncated copies in CL Brener because the truncations occurred before the regions containing the signature residues.
Part of the heterogeneity found in Family II is due to the fact that some sub-types are missing in each strain/haplotype. Dm28c contains interspaced czp2 and czp4; CL Brener EL, czp2, and czp3; CL Brener NEL, czp3, and czp4; YC6, czp2 and czp3; and TCC contains czp3 and czp4 (data not shown, sequences available at TriTrypDB). The sequences of the different sub-types within Family II share between 79-100% identity in the catalytic domain (Supplementary Table S5), and the third cluster, present in the YC6 strain, is hybrid, formed mainly by czp1 and a truncated copy of czp2.
In agreement with the differences regarding the encoded protein domains and the genomic organization, analyses of RNA-seq data showed that distinct expression patterns appear when Family members I and II are compared. Using the CL Brener RNA-seq data previously described by Belew et al. (2017) and Tavares et al. (2020), which evaluated global RNA levels present in epimastigotes, tissue culture-derived trypomastigotes, and intracellular amastigotes, we compared the transcription profiles of all cruzipain genes throughout the parasite life cycle. As shown in Figure 5, the RNA-seq analysis corroborates previous studies2,17 showing increased expression of cruzipain belonging to Family I in epimastigotes, whereas expression of most copies belonging to Family II are up-regulated in trypomastigotes. In amastigotes, most cruzipain copies are expressed at lower levels, when compared with epimastigotes and trypomatigotes, with the exception of three copies: czp.2.II.6_esmo and czp.3.II.7_esmo from Family II; and czp1.I.12 from Family I. The mRNA levels of the cruzipain gene belonging to Family I (czp 1.I.12) was found to be highly expressed in all stages, being potentially the most expressed cruzipain in CL Brener T. cruzi strain.
3. Differences in the active site of cruzipain sub-types
To investigate the structural differences among cruzipain sub-types, we modeled the catalytic domain of cruzipains from CL Brener, including at least one representative sequence from each sub-type. Due to the complete conservation of the active site within Family I and the existence of cruzain crystal structures corresponding to those sequences, only Family II representatives were modeled.
All modeled sequences showed total coverage and high identity (76% to 81%) to the cruzain template (PDB 1ME3). After energy minimization, validation of the models with QMEAN and ERRAT servers indicated their high quality, with no steric clashes or residues in the forbidden region of the Ramachandran plot (Supplementary Fig. S5, Supplementary Table S6). To generate models compatible with ligand binding, we modeled the proteins in the presence of the hydroxymethyl ketone inhibitor located in the PDB template. Additionally, to better evaluate the impact of residue differences on ligand binding, we determined the frequency of ligand interactions with each cruzain residue, considering all crystal structures available.
A total of thirteen positions that contain residue variations in the cruzipain sequences were evaluated. Overall, the sub-types’ differences impact the shape, volume, and physical properties of the active site (Figure 6). Even though these differences are distributed throughout the active site, there is an accumulation of amino acid substitutions within the S2 and S1’ subsites among the cruzipain sub-types. These regions are also more frequently involved in interactions with cruzain ligands (Table 2, Supplementary Table S7).
Two modifications occur in the S3 subsite: at residue 61, from Ser (czp1) to Phe (czp2 and czp.3.II.4_EL), and at residue 70, from Asn (czp1) to Lys (czp2 and czp.3.II.4_EL) or Val (czp.3.II.9_NEL and czp4). These changes affect the polarity and volume of S3, creating a better defined and positively charged cavity in czp2 and czp.3.II.4_NEL. Among the currently determined structures of cruzain-inhibitor complexes, Ser61 and Asn70 do not interact with any ligands.
The S2 subsite is the best-defined site in cruzain, playing an essential role in enzyme selectivity and ligand recognition27,30. This pocket concentrates most of the cruzain residues that interact with ligands in over 50% of the known crystal structures, including Gly66, which hydrogen bonds to 58% of the ligands, and several residues that form a hydrophobic pocket (Leu67, Met68, Ala133, Leu160) and establish hydrophobic interactions with at least 70% of the inhibitors (Supplementary Table S7). Among these highly interacting residues, we found differences in the cruzipain sub-types at positions 67, 68, 138, and 208 (Table 2). Positions 67 and 68 are critical to defining the shape of this pocket. Compared to czp1, which has Leu67 and Met68, czp2 has a bigger and more polar S2 due to Ser67 and Ser68. On the other hand, the two czp3 have a smaller and more solvent-exposed S2 than members of all other cruzipain sub-types since they contain Trp67 and Pro68; in the case of czp.3.II.4_EL, Ala138 is also replaced by a bulkier residue (Thr). The czp4 sub-type has Ser67 and Pro68 (Figure 6E). The most noticeable difference found in czp4 is the replacement of Glu208 by a glycine, making the S2 subsite shallow and open. The Glu208 at the bottom of the pocket of this cruzain subsite is essential for accommodating either hydrophobic or positively charged groups by cruzipain 1 in this subsite30.
The S1 subsite is similar in all cruzipain sub-types, and the only modification found in this region is the replacement of czp1 Ser64 by Gly64 in all other cruzipains. The modifications at S1’ impact the charge and volume of this subsite. Before the catalytic histidine, czp1 and czp4 have an Asp161, which frequently performs hydrogen bonds with ligands in cruzain crystals and donates a negative character to the S1’ pocket (Table 2). This position is occupied by polar neutral residues (Asn in czp2 and Ser in czp3) in other sub-types. A conservative substitution is found in position 142, containing either Ser (czp1, czp4, and czp.3.II.9_NEL) or Thr (czp2 and czp.3.II.4_EL). At the bottom of this pocket, Met145 interacts with 29% of cruzain crystallographic ligands, and Ile replaces it in czp3.
Additionally, we observed differences in position 159, located between the S2 and S1’ sites. Czp1 and czp3 have polar uncharged residues (Gln159 in czp 1 and czp.3.II.9_NEL, Thr159 in czp.3.II.4_EL), while in czp2 and czp4 there are basic residues, Arg159 and Lys159 respectively, in this position (Figure 6).
Table 2 – Amino acids substitutions in the active site of cruzipain sub-types and frequency of interactions of each position with cruzain ligands.
Sub-type/Archetype a
|
|
Residue
|
S3
|
S2
|
S1
|
S1'
|
S2/S1’
|
Cruzipain 1
|
S61
|
N70
|
L67
|
M68
|
N69
|
A138
|
G163
|
E208
|
S64
|
S142
|
M145
|
D161
|
Q159
|
Cruzipain 2
|
F
|
K
|
S
|
S
|
L
|
A
|
A
|
E
|
G
|
T
|
M
|
N
|
R
|
Cruzipain 3 (czp.3.II.4_EL)
|
F
|
K
|
W
|
P
|
L
|
T
|
A
|
E
|
G
|
T
|
I
|
S
|
T
|
Cruzipain 3
(czp.3.II.9_NEL)
|
S
|
V
|
W
|
P
|
L
|
A
|
A
|
E
|
G
|
S
|
I
|
S
|
Q
|
Cruzipain 4
|
S
|
V
|
S
|
P
|
L
|
A
|
A
|
G
|
G
|
S
|
M
|
D
|
K
|
Frequency of interaction with crystallographic ligands (%)b
|
0.0
|
0.0
|
79.2
|
70.8
|
0.0
|
83.3
|
0.0
|
41.7
|
3.0
|
0.0
|
29.2
|
50.0
|
4.2
|
a We modeled one or two isoforms from each cruzipain sub-type, corresponding to sequences from czp.2.II.3_EL, czp.3.II.4_EL, czp.3.II.9_NEL, and czp.4.II.7_NEL.
b Intermolecular interactions were analyzed with nAPOLI using ligands co-crystallized with cruzain.