Detection of Adhesin-like proteins in M. smithii and M. stadtmanae
ALPs have complex and diverse sequences with variable length. There are currently no comprehensive databases for the annotation of ALPs in methanogen genomes and ALPs have in general not been defined. In this study, we applied an expanded search approach using annotated ALPs from 13 other methanogens (refer to materials and methods) to query the genomes of M. stadtmanae and M. smithii. We report 49 ALPs in the genome of M. smithii and 42 in the M. stadtmanae genome, which is more than that reported in earlier studies [2, 3, 10]. Some ALPs were previously annotated as hypothetical proteins and were therefore not reported at the time the genomes were published. The 91 ALPs in the two species reported here account for ~ 10% of the genome of M. smithii and M. stadtmanae each (Table 1). This fraction is significantly larger in comparison to bacteria where only < 1.5% of the proteome is dedicated to code for adhesins [13]. This difference is mainly due to the higher number of ALP genes in methanogens as compared to those in Bacteria [13]. The lengths of ALPs ranged from 128 amino acids to 4691 amino-acid in M. smithii (average: 1299 amino acids) and 251 amino-acid to 3356 amino-acids in M. stadtmanae (average: 1462 amino acids). Both, the number and the length of ALPs, contribute towards the large fraction of genome coding for ALPs in two species.
Table 1
Summary of ALPs in M. smithii and M. stadtmanae
S.N. | Name | Genome size (Mbp) | Proteome accession | Number of ALPs | ALP coding genome (bp) | Fraction of ALP coding genome (%) |
1. | M. smithii | 1.853 | NC_009515 | 49 | 67292 | 10.9 |
2. | M. stadtmanae | 1.767 | NC_007681 | 42 | 63039 | 10.7 |
Identification and Characterization of Protein Domains in M. smithii and M. stadtmanae ALPs
Functional annotations using reference-based tools available in Interpro and Pfam [14] indicate an absence of domains in a large part of ALP sequences or only limited similarity of domains with Pfam references. Only 30 M. smithii and 24 M. stadtmanae ALPs could be matched to at least one domain in Pfam. We assigned domains to ALPs based on sequence similarity to nearest proteins in the Alphafold database (Supplementary Table 1). It was found that archaeal ALPs from the two methanogens have mainly three to four different domain types; membrane anchoring domain(s), Archaeal Big domain (ABD) and Right-handed Beta helical (RBH) domain, while other domains such as transglutaminase-like domain (TG-like), carbohydrate binding domains etc. were also detectable in some ALPs (Fig. 1). The frequency of occurrence of the different domains in ALPs is shown in Table 2. The detailed features of these domains are presented in sections below.
Table 2
Number of domains in ALPs of M. smithii and M. stadtmanae
S.N. | Domain annotations (Pfam) | Number of ALPs with these domains* |
M. smithii | M. stadtmanae |
1 | Archaeal Big domain (ABD) | 45 | 40 |
2 | Right-handed beta helical domain (RBH) | 28 | 34 |
3 | Transglutaminase-like superfamily | 2 | 3 |
4 | Pseudomurein-binding repeat | 2 | 3 |
5 | Carboxypeptidase regulatory-like domain | 1 | 3 |
6 | Chlamydia polymorphic membrane protein (Chlamydia_PMP) repeat | 1 | 2 |
7 | Papain family cysteine protease | 1 | 1 |
8 | PQQ-like domain repeats | - | 1 |
9 | Peptidase propeptide and YPEB domain repeats | - | 1 |
10 | Putative glycosyl hydrolase domain | 1 | - |
*Functional domain annotations are taken from Pfam database except for RBH and ABD domain. Alphafold structured were referred to confirm for presence of ABD and RBH domains only. |
Membrane Anchoring Domain in M. smithii and M. stadtmanae ALPs
Two types of MADs were recognised in ALPs of two organisms, transmembrane (TM) α-helices and amphipathic helices. TM helices were present in most ALPs of both organisms. In general, TM helices are present at the N-terminus of ALPs, in some cases, e.g. YP_001272624 of M. smithii a duplication of the TMH at the N-terminus can be observed. In 11 M. smithii ALPs MADs are present at both termini, which may indicate further complexity of potential interaction with other microbes. In comparison, M. stadtmanae has single N-terminal MAD in 39 out of 42 ALPs and three ALPs with no MAD identified.
For some ALP sequences (10 in M. smithii, 4 in M. stadtmanae), we noticed presence of N or C terminal helices with hydrophobic residues in the AlphaFold structure, however, TMHMM failed to assign them any transmembrane (TM) domain. These were < 20 amino-acid long sequences of hydrophobic residues which may not form a complete TM helix (Supplementary Table 2). The typical length of TM helix is suggested to be 24.0 (± 5.6) amino acids [15] though it depends on the amino acid sequence and hydrophobicity [16]. We noticed that the hydrophobicity index values of short hydrophobic helices in ALPs typically were > 1 comparable to that calculated for full length TM helices using HeliQuest sequence analysis module [17]. These short helices were rich in long chain hydrophobic residues such as Leucine, Isoleucine, Phenylalanine [18]. Together with TM helical domains, we marked these short helices also as membrane anchoring domains as they might also help anchor ALPs to lipid membrane. Earlier studies have shown that the lipid bilayer can adapt to TM helices as short as 10–12 Leucines [19] and that the lipid bilayers can adjust to negative mismatches [16]. It is proposed that the response of short helices to surrounding lipid bilayer depends on nature of lipids the helix is in contact with, its amino-acid composition and distribution of amino acids along the helix. The lipid bilayer may respond to hydrophobic mismatch caused by short helices by compression and chain disordering [20]. Further, the packing response of lipids around the single helix could be different than that for larger protein [21]. Marginally hydrophobic α-helices have also been shown useful in membrane protein folding [22].
In addition to the short membrane anchoring α-helices, we also observed the presence of amphipathic helices at the terminals in the structures of some ALPs where in most cases a TM helix was missing. We show using helical wheel diagrams (Supplementary Fig. 1) that such sequences in ALPs could fold into an amphipathic α-helix with a hydrophobic face and with the hydrophobicity index > 50% as predicted by HeliQuest [17]. It is tempting to speculate that such helices could anchor ALPs to lipid membrane by lying parallel to lipid bilayer membrane with hydrophilic surface interacting with charged lipid head groups while the hydrophobic side is exposed to the fatty acid chains of membrane lipid. Almost all ALPs had at least one MAD except five ALPs (YP_001272625, YP_001274282 from M. smithii and YP_447499, YP_447699 and YP_447953 from M. stadtmanae). This could be due to the partial annotations of these proteins from the genomic sequence, or it is also possible that the MAD could not be predicted with given algorithms.
Further, ALPs are also found to be rich in N/K/R residues at the N-terminus of transmembrane domain. The cytoplasmic di-Lysine motifs shown to be involved in trafficking of protein to ER and plasma membrane in earlier studies [23, 24] was observed at the N-terminus of transmembrane helices in 22 ALPs while 36 other ALPs had one of the Lysine substituted by Asparagine while others also had Arginine and aspartic acid (Supplementary Table 3). In ALPs, such motifs might be important for the protein to insert a MAD into the membrane in the required orientation. Interestingly, such motifs were also observed near the C-terminal TM helix indicating their possible orientation to be on cytoplasmic side (Supplementary Table 4). Presence of such signals adjacent to TM helices gives clue to the possible orientation of helices in cell membrane such as YP_001272624 of M. smithii has two TM helices at N-terminus. The di-Lysine motif is found only before second TM helix indicating its possible orientation from cytoplasmic to extracellular side. It is to be noted that such signal sequences were missing in most short helices suggesting that small helices could help in insertion although not spanning the whole membrane bilayer.
R ight-handed beta helical (RBH) domain in M. smithii and M. stadtmanae ALPs
RBH is the third most represented domain found in ALPs of the two methanogens (Table 2). M. smithii has 28 ALPs with 43 RBH domains and M. stadtmanae has 34 ALPs with 52 RBH domains. Many ALPs have repeats of RBH domain such as YP_001273761 of M. smithii, which has 7 repeats..
The RBH domain was initially identified in Erwinia chrysanthemi, a plant pathogenic bacterium as pectin binding domain of pectate lyase C [25] and subsequently has been discovered in many other enzymatic proteins such as those involved in hydrolysing Lectins and other carbohydrates [26]. The beta helical rod of this parallel β-helical domain provides larger groove on its surface for recognizing long carbohydrate molecules [27, 28]. This is mediated through the chain of conserved Asparagines in the loops which has been suggested to be the most common amino acid in RBH domains [29]. Loops mainly have charged and polar amino acids which explains their ability to bind to long polysaccharides. InterPro entry (IPR039448) indicates that RBH domain is highly represented in bacteria as compared to other groups of organisms, while in Archaea, it is mostly found in Methanobacteriaceae and Methanosarcinaceae families. This could be due to their presence in ALPs, which form a large fraction of archaeal proteome. Similar structures have also been observed in viruses for the purpose of host attachment and infection [30].
The fold is suggested to have diverged from a common ancestor based on presence of conserved alpha-helix capping the N-terminus of beta helix. The cap motif also inhibits oligomeric interactions like that found in amyloid formations [31]. Furthermore, it is noted that RBH domains in archaeal ALPs can be relatively large. The longest RBH domain (YP_447868 of M. stadtmanae) with a length of 1236 amino-acids folds into ~ 100Å long RBH domain rod carrying 14 turns. Although rare in eukaryotes, RBH domain is highly prevalent in surface proteins of bacteria and fungi with many of them involved in pathogenesis [33].
A rchaeal Big domain (ABD domain)
Archaeal Big domains (ABD) are the most abundant domain repeats in M. smithii and M. stadtmanae ALPs and found in almost all the ALP sequences. Figure 2 shows the phylogenetic tree constructed based on the alignment of 279 M. smithii ABDs and 222 M. stadtmanae ABDs with 80 bacterial stalk domains [13]. These ABDs seem to have diverged from bacterial stalk domains as most of the archaeal sequences cluster together in a phylogenetic tree and form groups distinct from the bacterial stalk domain sequences. The archaeal Big domain (ABD) definitions are not included in the Pfam domain database [14]. Pfam either failed to assign domain family to a large part of archaeal ALPs or assigned Big3 (Pfam ID: PF16640) and DUF11 (Pfam ID: PF01345) domains in most cases generally with high e-value. We also searched the ABD domains in ‘refseq_genomes (1711 databases) at NCBI excluding archaea (NCBI taxid:2157)’ and found no blast hit, indicating that these domains may be unique to archaeal species and are not ubiquitously found in other domains of life suggesting their potential specific role in archaeal symbiosis.
We notice broad clades of ABD domains in M. smithii and M. stadtmanae as marked in Fig. 2. Most clades containing no sequences of apparent bacterial origin, some ABD clades clustered with at least four bacterial stalk domain sequences (MucBP (A0A806LF85), LVIVD (A0A0S1YA82) and Trp_ring (F9N556) of nonESET clan), Big6 (A0A150KJ36), Big3_5 (A0A2V7S5F5), Big3 (R5U8D9) and Big2 (A0A0E1X8Y2) of ESET clan) in phylogenetic tree (Fig. 2). These could be the precursors from which archaeal ABDs evolved and subsequently diverged to acquire unique features. Some ABDs are longer as compared to the most common ABD domains and only few ALPs have such domains (YP_001272746, YP_001272984, YP_001274106, YP_001274107 of M. smithii) (Supplementary Fig. 2).
We analysed the frequency and patterns of amino acids present in ABDs and bacterial stalk domains Fig. 3. The comparison clearly shows that although archaeal ABDs have Glycine residues conserved similar to bacterial stalk domains, they also acquired unique features with high conservation. The uniquely conserved residues of archaeal ABDs are marked in Fig. 3.
An ABD folds into a typical β-sandwich in Greek key topology with seven strands (Fig. 4a). A β-sandwich domain of longer ABDs is formed by 9 β-strands (Fig. 4b). The conserved Glycine residues present in loops are marked in the representative three-dimensional structure obtained from AlphaFold as shown in (Fig. 4a). Notably, the conserved residues occur in loops, which may be important for interaction with other protein domains, while the core is conserved with hydrophobic residues. Further, we notice that some strands in ABD folds have conserved long chain hydrophobic residues such as Val, Ile, Phe, Leu while others show conservation of smaller residues such as Gly. The representative structure was searched in the Dali database to identify the closest structural homologue of ABD. It is interesting to note that the root mean squared deviation (r.m.s.d) of ABD of M. smithii (NCBI accession: YP_001272624) with the nearest structure (Big1 domain of Bacterial Invasin, PDB ID: 1CWV) was only 1.8Å while they shared only 10% sequence similarity. Similarly, the nearest structural homologue of another ABD domain of the same ALP belonging to different clade in phylogenetic tree (HLA Class-I Histocompatibility antigen, PDB ID: 1EWO, r.m.s.d.: 1.8Å) was only 18% similar. This indicated sequence divergence from bacterial ancestral homologs while conserving the overall three-dimensional fold of seven β-strands.
ABD is found in repeats on ALPs and may be important for extending the range to reach symbiotic microbes. ABD repeats of some ALPs are highly similar; example: YP_447631 of M. stadtmanae has total 27 ABDs belonging to two major clades in phylogenetic tree and 20 of them share 60.6% average similarity with each other, while some other ALPs have divergent ABD sequences belonging to other phylogenetic clades, example: YP_447476 of M. stadtmanae has three ABDs and all of them clustered with three different clades indicating early divergence of ABDs from bacteria.
Other domains
Less common domain types in the ALPs of M. smithii and M. stadtmanae are currently mostly grouped under ‘others’ as being different from the commonly observed MAD, RBH and ABD domain types. While the overall occurrence of these domain types is low in ALPs of these two methanogens, it is observed that these domains vary and, in some ALPs, multiple ‘others’ can be detected. These can comprise domains that show limited structural similarity to domains like transglutaminase, pseudomurein-binding protein, PQQ-like domain, Lectin-like domain etc. These are listed in Table 2 along with the number of occurrences in ALPs of two organisms.
Groups within M. smithii and M. stadtmanae ALPs
Functional domain annotations using Pfam indicated that M. smithii and M. stadtmanae ALPs are less diverse as compared to bacteria. Only 11 domain families were identified from sequence-based search in Pfam database in both species together. On the other hand, in bacteria, altogether 109 types of stalk and adhesive domains are present [13, 34]. The most common architecture in analyzed archaeal ALPs was RBH domain repeats at N-terminus followed by ABD domain repeats. A single transmembrane helix at N-terminus in majority of ALPs could act as membrane anchoring domain however, in others it was present at C-terminus or at both ends. In addition to above ALPs, there were 14 sequences in M. smithii and M. stadtmanae respectively that had missing RBH and ABD domain although these were picked in our sequence-based search. These could be partial as they are short and could likely presented incomplete domains. These sequences were discarded and not classified as ALPs in this study (Supplementary Table 5(a)(b)).
The alignment of ALPs with different repetitive structures of varying length may lead to misalignments, potentially introducing errors in phylogenetic inference. However, the detailed characterization of individual ALP domains allows to group different ALPs based on their specific domain architecture and independent of their sequence similarity (Fig. 5). As an alternative approach we have used density-based clustering of text strings that represent the domain architecture, which allows to bin ALPs into different distinct classes. Based on the clustering we propose five groups of ALPs in M. smithii and M. stadtmanae. A growing number of ALP groups might be expected in the future as more archaeal species are analysed for divergent ALPs. The groups proposed here are based on the presence/ absence and positions of ABD, RBH, transglutaminase and other domains as described further. In general, we observed that ALPs contain at least either ABD or RBH domain. If the protein sequence does not have any of these domains, we did not classify it as ALP although these proteins were picked up in our blast searches together with other ALPs. There were 9 such sequences in M. smithii and 5 in M. stadtmanae (Supplementary Table 5(a)(b)). All ALPs of M. stadtmanae have MAD only at the N-terminus (except 3 ALPs with no MAD) while 11 out of 49 M. smithii ALPs have MAD on both termini. Further, five ALPs of M. smithii (NCBI accession: YP_001272625, YP_001272839, YP_001273878, YP_001274107 and YP_001274163) were not fully annotated for domains due to low sequence similarity with AlphaFold structures or partially predicted structures. Thus, they were tentatively assigned ALP groups based on current knowledge of their domains. The following three groups (excluding subgroups) and one currently uncategorized set of “others” ALPs are proposed in M. smithii and M. stadtmanae regardless of position of MAD and are listed in Supplementary Table 6(a) (b).
Group-I
This is the largest ALP group in both M. smithii and M. stadtmanae (n = 52) and most proteins in this group consists of only three domain types, i.e. MAD, RBH domain and ABD. No ‘Other’ domain is present in this ALP group. The MAD is N-terminal in 48 ALPs, the only modifications being a duplication in one ALP (YP_001272624 of M. smithii) and an additional C-terminal MAD in two ALPs (YP_001272984 and YP_001273972 of M. smithii). Visual inspection of this group indicates that this group can be divided further into two subgroups
Subgroup-IA
Adjacent to the MAD is in most cases (n = 40) the RBH domain followed by repeats of ABD domains in varying numbers. In general, RBH domain is present as single domain in majority of ALPs (10 out of 16 M. smithii ALPs and 18 out of 24 M. stadtmanae ALPs). Several variations of this pattern are observed in this subgroup, e.g. multiple RBH domain can be present. Notable example is YP_001273761, which has 7 RBH domain repeats at N-terminus followed by 18 ABD domain repeats. This is also the largest ALP found in M. smithii with a length of 4691 amino acids. Compared to other subgroups, the ALPs of this group are larger.
Subgroup-IB
Similar to subgroup-IA, ALPs of this group also contain only ABD and RBH domain in addition to MAD however the relative positions of ABD and RBH domains are not fixed. In some ALPs, ABD repeats are found N-terminally to the RBH domain, e.g. YP_448130. Except YP_001273972, all other ALPs of this group have single MAD at the N-terminus. YP_001274127 has an amphipathic helix at the N-terminus which could act as membrane anchor as there is no TM domain. This ALP is also unique with a small RBH domain at C-terminus and ABD domains at N-terminus to RBH. M. smithii has 7 and M. stadtmanae has 5 ALPs belonging to this group. No membrane anchor domain was located in YP_447953 of M. stadtmanae.
Group-II: ALPs of this group have only one, either ABD or RBH domain present in them, in addition to MAD at either or both terminals. Although such ALPs are present in our dataset (15 in M. smithii and 5 in M. stadtmanae), it might be possible that these are proteins with unidentified N or C terminal domains because of low sequence homology within the domains in AlphaFold structures. This is evident from the fact that 4 M. smithii ALPs of this group are partially annotated for domains. It is also possible that protein sequences of these ALPs are partial. For example: YP_001274262 has only one ABD in addition to two amphipathic helices possibly acting as MADs at N-terminus and, it is only 203 amino acids long. Similarly, YP_001274311 of M. smithii is 156 amino-acid long and, has one amphipathic helix at N-terminus and only one ABD domain.
Subgroup-IIA
ALPs of this subgroup (n = 13) are characterized by having only ABD while RBH and ‘Other’ domains are completely absent. MAD is found N-terminal in 8 ALPs, 1 ALPs have C-terminal MADs, one had on both termini while in other three ALPs no MAD could be identified. ABD is present in with varying number of repeats e.g. 1–17 repeats.
Subgroup-IIB
This group of ALP have only RBH domain and no other domain in addition to MAD. Five ALPs belong to this subgroup (4 from M. smithii and 1 of M. stadtmanae). All ALPs from M. smithii of this group have MAD on both termini while M. stadtmanae ALP had one only N-terminally. Repeats of RBH domain are present which could mediate interactions by extending the length of ALPs. It is interesting to note that RBH is not present as repeats in other groups.
Group-III (Transglutaminase (TG)-type)
ALPs in this subgroup have Transglutaminase domain at C-terminus and ABDs at the N-terminus. Pfam also identified pseudomurein binding repeat domain beside transglutaminase domain. Compared to other groups, this group has shorter ALPs. These ALPs have single N-terminus transmembrane helix serving as membrane anchor. Further, there are no RBH domains in this group of ALPs which probably points that RBH domain could also act as adhesive domain in ALPs of other groups in addition to extending ALPs to reach surface of other microorganisms. Currently, there are only four ALPs in this group from two organisms.
Others
There are 7 M. smithii and 6 M. stadtmanae ALPs in our dataset that contain domains in addition to RBH, ABD or transglutaminase domain (Table 2). These unique domains could be involved in specific substrate binding. YP_001273741 has MAD on both termini. The seven M. smithii domains had alpha-beta fold in closest AlphaFold structures. Six M. stadtmanae ALPs were as annotated by Pfam. Further, the four M. smithii ALPs (YP_001272746, YP_001272984, YP_001274106, YP_001274107) with longer ABD domain (with 9 β-strands) are marked with an asterisk in a phylogenetic tree (Fig. 5). YP_001272746, YP_001272984 have ABD with a beta strand extending of from a loop of RBH domain. These two ALPs have MAD on both sides. It is interesting to note that most ALPs in other groups, with MAD on both sides, have single domains and not repeats of ABD or RBH domains. Since these ABDs are structurally distinct from other ABDs, it might indicate the functional divergence acquired from other ALPs.