The systematic workflow for collecting, expanding, and mining QS entries. To construct a comprehensive QS repository for human gut microbiota, we developed a systematic workflow which includes three modules (QS collecting, QS expanding, and QS mining modules) and four ensemble classifiers based on ML algorithms (Fig. 1). In the QS collecting module, we firstly obtained 213 recognized QS entries (Dataset I) from SigMol and Quorumpeps databases, and curated their corresponding amino acid sequences from the UniProt database. In parallel, we manually searched the 818 gut microbes from the VMH database27 (Dataset II) to collect reported QS entries which are termed “positive samples” (Dataset III). The search was based on four commonly used QS annotations, i.e., “quorum sensing”, “LuxR”, “two-component”, and “tryptophanase”. The negative samples (Dataset IV) were then obtained by removing proteins from typical proteomes in dataset II, such as Escherichia coli and Pseudomonas aeruginosa (more details in Method section), that conform to QS cluster rules. These rules were developed based on Dataset I through sequence analysis, including evolution analysis, QS-relevant protein annotations, and amino acid sequence descriptors comparison (more details in Method section). In the QS expanding module, we obtained an extended protein dataset (Dataset V) from the results of the local BLASTP36 on the datasets I and II with the criteria of the E value37 being smaller than 10-5, which is commonly used in sequence alignment to obtain homologs. Four different ML algorithms (SVM, RF, KNN, and DNN) were used to construct ensemble classifiers, which were trained and validated based on the above positive (III) and negative samples (IV). After excluding from dataset V those which are already collected as the reported QS entries in dataset, the remaining QS entries (Dataset VII) were then classified by the four ML-based ensemble classifiers stated above. The output of these classifiers was further processed in the QS mining module, where the potential QS entries predicted by the classifiers, which had not previously been discovered and annotated, were mined and sorted out manually with the help of the functional analysis and homologous modelling that were supported by UniProt38, NCBI (https://www.ncbi.nlm.nih.gov/) and Phyre2 databases39.
Reported and annotated QS entries. There are 84 autoinducer synthases and 129 QS receptors in dataset I. With respect to autoinducer synthases, we divided them into seven types, i.e., AHLs, DSFs, AI-2, indole, HAQs, CAI-1, and others. As a result, AHLs synthases account for the vast majority, which among other possibilities can be divided into two protein families, LuxI (from Vibrio fischeri) and YenI (from Yersinia enterocolitica) (Fig. 2A). With regard to QS receptors, we also divided them into seven types, i.e., LuxR type, TCS type, CAI-1 receptor, AI-2 receptor, DSFs receptor, HAQs receptor, and other receptors (Fig. 2B). LuxR and TCS type receptors account for the vast majority of QS receptors. Similarly, LuxR type receptors can be roughly divided into two protein families, LuxR (from V. fischeri) and YenR (from Y. enterocolitica). Note that the evolutionary trees of AHLs synthases and their receptors counterpart are in a high similarity (Fig. 2A and 2B), part of which was also identified by Gray et al40. This indicates that there is a coevolution for AHLs synthases and their corresponding receptors.
There are 1,640, 5,921, 15,703, and 66 QS entries for “quorum sensing”, “LuxR”, “two-component”, and “tryptophanase”, respectively (Fig. 2C). LuxR-type and TCS QS entries account for the vast majority, which are 25.38% and 67.31%, respectively. We have also shown the distribution of QS entries for each strain based on the seven-strain simplified human microbiomes (SIHUMIs) used by Colosimo et al41 (Fig. 2D). This indicates that LuxR and TCS type QS entries account for the vast majority of QS entries in these strains. Furthermore, we noted that there are certain overlaps in the distribution of the four QS entries. For example, there are seven QS entries (P69409, P0ACZ6, P0AGA8, P66798, P0AF30, P0AEL9, and Q8XE66) shared by both LuxR and TCS receptors in the E. coli O157:H7 strain (Fig. 1E). This suggests potential crosstalk of LuxR type and TCS QS systems. In addition, we have counted and distributed the total QS entries of the 818 gut microbes from the VMH database27 to form a better picture of the QS repository in human gut microbiota (Fig. 2F). As a result, we found that about 90% strains contain less than 60 QS entries, and only seven strains have more than 150 QS entries, which have been listed in Fig. 2F. This distribution will be revisited after extended QS entries are included (see below).
Expanded and new QS entries. We conducted 5-fold cross validation to test classifiers, where the accuracy, prediction, recall, and F1 score (more details were listed in method section) were applied to evaluate their performances. The RF classifier achieves the highest accuracy and F1 score among the four classifiers, which indicates that the RF classifier achieves the best performance, followed by KNN, SVM, and DNN. To obtain more details of the positive entries predicted by different classifiers, we have manually checked their annotations and categorized the proteins into four types, i.e., QS irrelevant, autoinducer synthases, QS receptors, and uncharacterized proteins. The results show that QS receptors account for the vast majority, followed by the autoinducer synthases (Fig 3B).
In addition to the collection of the confirmed autoinducer synthases and QS receptors, we have further analysed the details for the uncharacterized proteins (534 entries) from the positive ones predicted by the three better performed classifiers (RF, SVM, KNN), in order to mine more QS relevant proteins. As a result, we have re-annotated the 534 entries and grouped them into nine protein clusters manually (Fig. 3C), in which the histidine kinase (a major component in a TCS) occupied the majority. Note that there were another 28 entries that are vaguely described without specific protein annotations (Fig. 3C). As listed in Table 1, these entries were further explored and re-annotated based on the web BLASTP of NCBI database or Phyre2. There were 20 proteins (Table 1, upper) that can be re-annotated based on the BLASTP results from NCBI. Except U2J6M1 and C0C5Y6, there is much potential for the other 18 proteins to be QS proteins. ArsR, a component of ArsRS TCS, regulates the acid adaptation and biofilm formation of the pathogen Helicobacter pylori in human gut42. Beta-ketoacyl-ACP synthase III catalyzes the condensation reaction of fatty acid synthesis, which indicates that there is potential for Prevotella bivia to produce Dialkylresorcinols just like the function of DarB from Photorhabdus asymbiotica43. The histidine kinase, LuxR family regulator, and Rgg/GadR/MutR family regulator are important parts of TCS, LuxR-type, Rgg-based QS systems44, respectively.
There are eight entries (Table 1, lower) that have no specific annotations or classifications in NCBI or UniProt database. We submitted these protein sequences to Phyre2 to investigate the 2D and 3D structures of their models, their domain compositions and model quality. A0A4Y4IIW5 and A0A5C4P2T9 are signalling protein and AgrC (belonging to Agr QS system45) family protein, respectively. This indicates that Lysinibacillus fusiformis and Streptococcus salivarius may have some protein components of the agr QS system, thus producing and/or responding to the same QS signalling peptide as common pathogen Staphylococcus aureus. The other six of them are templated on the AimR transcriptional regulator, which is the intracellular signal peptide receptor for the QS-based communication between viruses that guides lysis–lysogeny decisions46. This suggests that different Bacillus phages may “listen in” diverse bacterial hosts, such as Bacillus amyloliquefaciens, Bacillus mycoides, Bacillus thuringiensis, and Bacillus atrophaeus, to coordinate lysis–lysogeny decisions.
Table 1. Results of 28 expanded entries without existing annotations.
Strains
|
TaxID
|
Entry
|
Template
|
Query Cover
|
Percent identity
|
New annotations
|
Sources
|
Halococcus morrhuae
|
931277
|
M0MA34
|
WP_004054989.1
|
100%
|
100%
|
ArsR subfamily of regulator
|
Web BLASTP
|
Clostridium hylemonae
|
553973
|
C0C300
|
WP_006443816.1
|
100%
|
100%
|
Autoinducer 2 ABC transporter
|
Web BLASTP
|
Prevotella bivia
|
868129
|
I4Z9V6
|
WP_036847997.1
|
80%
|
80.39%
|
Beta-ketoacyl-ACP synthase III
|
Web BLASTP
|
Enterococcus caccae
|
1158612
|
R3TYZ5
|
WP_069646785.1
|
100%
|
80.80%
|
Histidine kinase
|
Web BLASTP
|
Lactobacillus ruminis
|
525362
|
E7FSN7
|
WP_003695050.1
|
98%
|
98.96%
|
Histidine kinase
|
Web BLASTP
|
Streptococcus peroris
|
888746
|
E8KCS5
|
WP_070888551.1
|
100%
|
99.58%
|
Histidine kinase
|
Web BLASTP
|
Streptococcus parauberis
|
1348
|
A0A3E1JFV3
|
WP_116486843.1
|
100%
|
100%
|
Histidine kinase
|
Web BLASTP
|
Hungatella hathewayi
|
566550
|
D3ADP6
|
PXX46370.1
|
98%
|
92.45%
|
LuxR family regulator
|
Web BLASTP
|
Enterococcus cecorum
|
1121864
|
S1R0J3
|
WP_047242627.1
|
100%
|
97.31%
|
Rgg/GadR/MutR family regulator
|
Web BLASTP
|
Enterococcus cecorum
|
1121864
|
S1R7E8
|
WP_171336239.1
|
98%
|
93.70%
|
Rgg/GadR/MutR family regulator
|
Web BLASTP
|
Streptococcus constellatus
|
1035184
|
U2ZME3
|
WP_022525523.1
|
100%
|
100%
|
Rgg/GadR/MutR family regulator
|
Web BLASTP
|
Streptococcus equinus
|
525379
|
E8JR85
|
WP_029875994.1
|
97%
|
97.20%
|
Rgg/GadR/MutR family regulator
|
Web BLASTP
|
Streptococcus intermedius
|
1095731
|
U2XPZ3
|
WP_003032153.1
|
100%
|
100%
|
Rgg/GadR/MutR family regulator
|
Web BLASTP
|
Candidatus Melainabacteria
|
2052166
|
A0A3S0FWU1
|
MBI4533416.1
|
80%
|
47.68%
|
Sensor histidine kinase
|
Web BLASTP
|
Candidatus Melainabacteria
|
2052166
|
A0A431KQ57
|
MBI5174129.1
|
79%
|
47.28%
|
Sensor histidine kinase
|
Web BLASTP
|
Coriobacteriales bacterium
|
2491116
|
A0A437UTJ5
|
WP_130811315.1
|
99%
|
43.81%
|
Sensor histidine kinase
|
Web BLASTP
|
Lactobacillus amylolyticus
|
585524
|
D4YTV9
|
EST03116.1
|
97%
|
36.63%
|
Sensor histidine kinase
|
Web BLASTP
|
Alistipes putredinis
|
445970
|
B0MUZ2
|
OKY96599.1
|
100%
|
96%
|
Tryptophanase
|
Web BLASTP
|
Sphingobacterium paucimobilis
|
1346330
|
U2J6M1
|
WP_021069213.1
|
100%
|
100%
|
DoxX family, membrane protein YphA
|
Web BLASTP
|
Clostridium hylemonae
|
553973
|
C0C5Y6
|
WP_006444869.1
|
100%
|
100%
|
Sugar ABC transporter protein
|
Web BLASTP
|
Strains
|
TaxID
|
Entry
|
Template
|
Confidence
|
Coverage
|
New annotations
|
Sources
|
Bacillus amyloliquefaciens
|
1390
|
A0A5C8IUS9
|
c5xybB
|
100%
|
97%
|
AimR transcriptional regulator
|
Phyre2
|
Bacillus mycoides
|
1405
|
A0A1W6AJT8
|
c5zvvA
|
100%
|
90%
|
AimR transcriptional regulator
|
Phyre2
|
Bacillus thuringiensis
|
56955
|
A0A243M9P9
|
c5zw5A
|
100%
|
95%
|
AimR transcriptional regulator
|
Phyre2
|
Bacillus amyloliquefaciens
|
1390
|
A0A5C8IY56
|
c5zvvA
|
100%
|
99%
|
AimR transcriptional regulator
|
Phyre2
|
Bacillus atrophaeus
|
720555
|
A0A0H3E1W6
|
c5zvvA
|
99.90%
|
98%
|
AimR transcriptional regulator
|
Phyre2
|
Bacillus atrophaeus
|
720555
|
A0A0H3E2G4
|
c5zw5A
|
100%
|
100%
|
AimR transcriptional regulator
|
Phyre2
|
Lysinibacillus fusiformis
|
28031
|
A0A4Y4IIW5
|
c6mfvC
|
100%
|
90%
|
Signaling protein (tetratricopeptide repeat)
|
Phyre2
|
Streptococcus salivarius
|
1304
|
A0A5C4P2T9
|
c4bxiA
|
99.90%
|
33%
|
ATP binding domain of AgrC
|
Phyre2
|
To sum up, with the help of the proposed systematic workflow (Fig. 1), we obtained a comprehensive QS repository including the manually collected 21,410 positive samples and the extended 7,157 ones for 818 gut microbes, and the total 28,567 QS entries are composed of 1,882 QS synthases and 26,685 receptors. There was a 33.43% increase of QS entries for the comprehensive QS repository (Fig. 3D) from the the previous annotation-based QS collections (Fig. 2F). Furthermore, included in the extended entries, we have re-annotated 534 proteins and mined eight new potential QS proteins with the help of functional analysis and homologous modelling. This is of great significance to the further exploration of the related QS mechanism and their applications.
QSHGM browsing and searching. To enable user-friendly browsing and searching for QS entries identified in this work, we constructed a comprehensive QS database of human gut microbiota (QSHGM), which is freely available at: http://www.qshgm.lbci.net/. A user-friendly “‘browse” option allows to explore the QS data including the annotated QS and extended QS entries. In the “browse” option, a query box is provided in which the user can enter the query on the basis of “All”, “Synthases” or “Receptors” for the browsing of QS entries. By “Synthases”, one can query QS entries according to nine QS languages: AHLs, CAI-1, Dialkylresorcinols, Photopyrones, DSFs, HAQs, AIPs, Indole, and AI-2. As an example, we have illustrated part of browsing results for AHLs language in Fig. 4, and the output displays information of the QS entries, fielded by Entry, Genus, Species, Strain, Taxonomic identifier (TaxID), Protein annotations, conventional abbreviations of QS signals (Languages), and Link Address.
QSHGM also includes “Search” searching facilities for different QS entries. In the search option, a query box is provided in which the user can enter the query on the basis of “Microbes”, “Synthases” or “Receptors” for the searching of QS entries. By “Microbes”, one can query QS entries according to different options: Entry (e.g., J7JCP9), Name (e.g., Pseudomonas aeruginosa), or TaxID (e.g., 208964). The output displays information of the QS entries, fielded by Entry, Organism from WMH, TaxID from Uniprot (Uniprot), Proteome ID, substitute organism (Organism), substitute organism TaxID (Organism ID), all protein counts (All Proteins), QS entries counts for the strain (Counts), and Protein annotations. The “Synthases” is provided according to different options: Entry (e.g., J7JCP9) or Languages (e.g., AHLs). By “Receptors”, one can search with different options: Entry (e.g., P25084) or Annotations (e.g., Histidine kinase). The output displays information of QS synthase and QS receptors, fielded by Entry, Genus, Species, Strain, TaxID, Protein annotations, and Languages. Note that search type allows users to retrieve either an exact match or the match containing the query.
QS-based microbial interactions prediction. QS-based microbial interactions play an essential role in deciphering complex interactions of natural microbial systems and dynamically manipulating diverse synthetic microbial consortia. According to the collected data in the QSHGM database, we can predict various potential pairwise QS-based microbial interactions. For example, we predicted AI-2-based communication between E. coli O157:H7 and Bacteroides pectinophilus ATCC 43243 (Fig. 5A), which is in line with the previously reported observation that AI-2 produced by E. coli can influence the Bacteriodetes47. Furthermore, TnaA (encoding indole) was previously reported in E. coli48 and Enterobacteriaceae49, which is also indicated by the QSHGM database, suggesting that there will be indole-based interaction between these two microbes. Therefore, we predicted that a microbial consortium including E. coli O157:H7, B. pectinophilus ATCC 43243 and E. bacterium 9_2_54FAA can be regulated by manipulating the concentration level of AI-2 and indole (Fig. 5B). Furthermore, we can predict more sophisticated interaction networks. When introducing the P. aeruginosa PAO1 into the above three-strain consortium, there will be complex microbial cell-cell communications based on AI-2, AHLs and indole (Fig. 5C), in which the interactions between P. aeruginosa PAO1 and E. coli were reported and validated previously50, 51. When adding Burkholderia cepacia GG4 to the above four-strain consortium, we can also predict the complex QS-based interaction network for a five-strain consortium that communicates with AI-2, indole, AHLs, HAQs, and DSFs (Fig. 5D), which included a previously validated HAQs-based interaction between P. aeruginosa and B. cepacia GG452. To sum up, QS-based interaction predictions stated above have been partially verified in the corresponding experiments from other reported researches. Therefore, it has huge potential to predict more complex QS-based interaction networks including multi-component strains based on diverse QS languages.
QS communication network construction for the human gut microbiota. Microbes communicate via various QS signals, and it is possible to construct a cell-cell communication network among different gut microbes based on diverse QS languages, which we termed as “QS communication network”. With the help of the comprehensive QS repository in the QSHGM database, we constructed a QS communication network for the 818 gut microbes based on the “speaking” of the above nine QS languages (Fig. 6A). This intricate network visualizes the complex QS-based communications and interactions among human gut microbiota. Different microbes are linked together through various languages to form a microbial communication network, and the connections could be used to regulate the microbial interactions between themselves and the surrounding ones. As shown in Fig. 6A, most of the strains produce the signal AI-2 (567, 69.3% of 818 gut microbes) as the communication language, followed by HAQs (332, 40.6%), DSFs (325, 39.7%), CAI-1 (259, 31.7%), Dialkylresorcinols (129, 15.8%), Photopyrones (107, 13.1%), indole (77, 9.4%), AHLs (64, 7.7%), and AIPs (22, 2.7%).
Note that multiple microbes can speak one common language which is in line with the interspecies crosstalk53. Taking six typical languages (AHLs, CAI-1, HAQs, DSFs, Indole, and AI-2) as example, we found that there are 64, 40, 22 and 5 species sharing two, three, four, and five QS languages, respectively (Fig. 6B). AI-2 also ranks first with the highest genus-level counts (138 genus) than the other languages, which is in line with what has been broadly observed13. Many overlaps of the languages being spoken (between different microbes) include AI-2 or indole for various genus, which also indicates that both of them are widely recognizable languages playing a major role for inter-specie communications54, 55. We found that those traditionally often considered as intraspecies languages (AHLs, CAI-1, HAQs, and DSFs) may also be involved in some interspecies communications. In addition, the crosstalk of different QS languages for various microbes implies the redundancy of microbial languages that is potentially helpful for the stability of natural microbial systems.
The QS communication network was constructed based on the 818 human gut microbes, which include mainly Firmicutes (79), Actinobacteria (36), Proteobacteria (69), Bacteroidetes (16) and others (10). We have collected and sorted the nine QS languages for 210 microbes at the genus level, shown by the heatmap representation in Fig. 6C to gain a better understanding of the QS communication network (Fig. 6A). It has previously been reported that AHLs are only found in Proteobacteria56, AIPs exist mostly in Firmicutes12, and other QS languages are distributed in-homogeneously in the whole genus-level microbes57, with which the QS communication network predicted by the QSHGM database agrees. Surprisingly, there are no highly similar distributions of QS languages within the same genus-level microbes. On the contrary, taking the distribution of QS languages in Actinobacteria as an example, the language distributions are quite different between its members (Fig. 6C, cyan). This suggests that the existence and evolution of autoinducer synthases in microbes might have not been strictly familial at the genus level, but are more likely to be related to a variety of factors, such as environmental factors and spatial distributions58-60. To sum up, these predicted patterns of distribution of QS languages between these microbes suggests the diversity of the microbial communication languages, the complexity of cell-cell communication, and the redundancy of QS-based interactions among human gut microbiota.