The dependency of previous whole-genome operon prediction methods on experimental and functional information and unavailability of such information in metagenomic data makes metagenonmic operon prediction a tricky task [44, 56–62]. We addressed these limitations via MetaRon, by accurately predicts metagenomic operons without the dependency on functional or experimental information.
MetaRon Implementation
Simulated Genomes
We started with the implementation of MetaRon pipeline on E. coli K-12 MG1655 raw reads simulated via Next-Generation Simulator for Metagenomics (NeSSM) [63]. The microbe is considered as the gold standard for operons. This implementation serves as a litmus test for the performance. The simulation of E. coli genome produced around one million paired end reads of 100 bp length at 20X depth. MetaRon assembled the raw reads via IDBA [64] into 82 scaftigs. The scaftigs with length less than or equal to 500 bp were removed. The remaining scaftigs contains 4,227 genes that were predicted using prodigal [41]. In the first step, MetaRon identified 822 co-directional proximal gene clusters (IGD < 601 bp), containing 2,955 genes. These gene clusters were named as proximons, since they were identified based on direction and intergenic space, as defined by proximon proposition [65–67]. The proximon cluster length range from binary (2 genes) to 32 genes, with no proximons of length 17, 21, 23, 24, 26, 27, 28 and 29 (Fig. 3).
Out of 822 proximons, majority of the clusters (32.9%) are binary while 19.7% and 11.8% proximons are three and four genes long, respectively. The remaining 35.5% of proximons are longer than four genes (Fig. 4). Introduction of a structure defining feature clearly refined the results from proximons to operons. Many genes that that were a part of the proximon cluster are removed by adding the promoter parameter hence, the number of genes in each cluster reduced, leaving behind co-directional closely packed genes that are under the control of a single promoter. This means, an increase in the percentage of binary operons, three and four gene operons but a decrease in clusters with length more than four genes. The proportion of operons with length 2–4 increased to 78% as compared to 64.5% of proximon clusters (Fig. 4). At this point, it is imperative to highlight that no Transcription Unit Boundary (TUB) is definedin the proximal gene clusters. This means that a proximon or a candidate operon might enclosemore than one operon or non-operonic genes. Therefore, we added upstream promoter as a more stringent and structure defining parameter to outline the transcription start and end for an operon (Fig. 1). Prediction of the upstream promoter in the proximal gene clusters results in the removal of non-operonic genes (false positives) and definition of TUBs. This resulted in a total of 828 operons containing 2,893 genes. The longest operon contained 16 genes [68–71]. In comparison with the operons from DOOR database [69, 71], MetaRon achieved a true positive rate of 97.8%. The percentage of binary settings in case of operons increases to 43.9% (364 operons) while the percentageof operon length 3 and 4 is 21.2% (176) and 13.2% (110), respectively (Fig. 4).
These results corroborate with the fact that in E. coli genome, the majority of the operons have binary organization [72, 73]. The percentage of binary gene clusters hold a significant role in accessing the operon predictions since, most of the operons in microbial genomes are binary [27]. An increase in the proportion of such operons in comparison with proximal gene clusters signifies the removal of false positives and improved sensitivity. About 21.7% of operonic clusters have length ranging between five and sixteen (Fig. 4). In order to test its applicability and accuracy, MetaRon was also implemented on raw reads simulated from a whole-genomes of E. coli MG1655, M. tuberculosis H37Rv and B. subtilis 168. The simulation of above mentioned 13,266,813 bp long genomes resulted in two million reads simulated at 15X depth via NeSSM [63]. The resultant 2,514 proximons encompassing 10,625 genes are identified from 232 scaftigs comprising 12,481 genes. The proximons range from 2 to 36 genes in length. In the proceeding step, 2,579 operons containing 8,749 genes are identified. The comparison with the DOOR database demonstrated the sensitivity, specificity, and accuracy of 93.7%, 75.5%, and 88.1%, respectively. Although, MetaRon achieved better performance than previous metagenomic operon prediction work, the performance is affected by reasons such as an operon divided between two scaftigs or the promoter prediction as well as the non-availability of reference. Never the less, the results achieved are encouraging enough to proceed with real data, which is more diverse and complex.
E. coli C20 draft genome operon prediction
We then implemented MetaRon on E. coli C20 draft genome isolated from the metagenome of chicken gut. MetaRon identified 4,544 genes from 4,640,940 bp long genome and resulted in 841 proximons and 946 operons containing 3,937 and 2,409 genes respectively. The longest operon by the number of genes reduced from 33 genes long proximon to 10 genes long operon while the percentage of binary operons significantly increased from 32% (268 proximons) to 71% (673 operons). In this case, MetaRon achieved sensitivity, specificity, and accuracy of 87%, 91%, and 88%, respectively [69, 71]. Majority of operons (68%) discretely mapped to a single reference operon while 20% of predicted operons have over one hit with the reference. Twelve per cent operons demonstrated unique configuration that has less than 50% match with the reference (Fig. 5). This is expected due to the fact that similar genomes could demonstrate variable operonic settings in different conditions [74–77].
Since metagenome data does not have a complete reference on which the raw reads could be mapped, so it is assembled into multiple contigs/scaftigs, rather than in one whole-genome; hencemultiple operonic configurations were observed (Fig. 6). Unlike the proximon proposition, where the majority of the proximons were mapped to more than one operon in a subset fashion, 66% of the operons identified via MetaRon matched precisely to one reference operon as a perfect match. About 8% of the operons show a subset configuration (Fig. 7). A subset configuration refers to an exact match with one or more extra genes in the reference (Fig. 6). While 4% of the predicted operons displayedcontrariwise formation known as a superset configuration, i.e., an operonis longer than the reference operon by one or moregenes (Fig. 7). The subset formations could be due to the distribution of an operon between two scaftigs or different transcription unit boundary (Fig. 6). Furthermore, there were operons that encompasses more than one reference operon in an exact or partial match. Such operonic settings are named as bridge-1, while the other way around is named as bridge-2. About 5% of the above-mentioned bridge settings are observed in the predicted operons. Bridge configurations could be due to altered TUB or the inability of the promoter prediction tool to identify the promoter.
Metagenomic data from various conditions demonstrates new microbial functions under different levels of stress and environmental stimulus [24]. Many unique operonic organizations are likely to appear as a response to new environmental stimuli. This leads to the formation of new or modified operonic configurations such as subsets, supersets or unique operons. In the case of E. coli C20, 17% of predicted operons have less than 50% or no match with the reference (Fig. 7). Such unique organizations may well carry precious insights about the microbial activity for a particular environment regarding bacterial products and pathways [24]. Such insights at metagenomic scale could be valuable in understanding disease condition, its prevention and possibly the cure as well.
Application to Type 2 Diabetes metagenomes
MetaRon was further implemented on whole-metagenomic raw reads from the gut of 145 Chinese individuals (74 Type 2 Diabetic (T2D), 71 controls) [39]. The two groups of individuals are further divided into four sub-groups in each category based on gender, weight and diabetic/non-diabetic (Table 1). MetaRon produced 3,868,389 operons containing 12,414,125 genes (Fig. 8). This makes up almost 50% of the total 23,280,123 genes. There could be similar operons in different samples, so for better organization and an operon catalogue was curated, which resulted in 1.23 million unique operons. The longest operon is 185 genes long. An average 61.3% of the predicted operons showed binary setting. The percentage was consistently high in all samples as shown in Fig. 8.