Pan-genome comparisons and matrix generation
A pan-genomic analysis underpinned the metabolic reconstruction effort by supplying extensive functional and sequence-based information (Figure 5). All clustering and pan-genomic analyses were performed on GenBank files containing gene and functional annotations from KBase using the Get_homologues package [58] which comes with a well-documented manual. Prior to clustering, exceptionally long annotations, which can arise through non-specific RAST annotations, were truncated as these become the file names of clusters generated by Get_homologues and cause the program to error when they exceed the maximum number of bytes for a Linux operating system. The analyses were performed on translated protein sequences with the OrthoMCL algorithm (OMCL) [70] with a granularity parameter of 1.5, minimum coverage of 75% and maximum E-value of 10–5. Given there were multiple closed genomes for two strains, the analysis was split into two stages. First an intra-species analysis was performed and representative strains for each species were selected for an inter-species analysis (Figure 5). Such an approach avoids generating clustering bias; likely to arise from the intra-species variability exceeding the inter-species variability. The availability of additional experimental data made P. acidipropionici ATCC 4875 the ideal representative strain for P. acidipropionici while P. acnes 6609 was chosen as the intra-species analysis suggested it was the most functionally robust and therefore most representative of the full metabolic functionality of P. acnes. Estimation of the core and pan-genome size was performed using the OMCL clusters with the random sampling implementation of the Tettelin [6] and Willenbrock [23] algorithms. Trees generated from Get_homologues were visualised using FigTree (available at http://tree.bio.ed.ac.uk/software/figtree/).
While orthologous clusters are good indicators of conserved functionality of a cluster and phylogenetic relationships, they were found to be poor representatives of functional capacity of the genus. Evidently with a small set of strains of differing evolutionary distance, the algorithms will cluster true homologues in separate groups based on small sequence differences. For example, robust analyses using the intersecting clusters of the OMCL, BDBH and COG algorithms [58] (Figure 5) excluded core metabolic clusters including functionalities from the Wood-Werkman cycle, the pentose phosphate pathway and the gluconeogenic malic enzyme (S9, Additional file 1). Using just the OMCL algorithm still obscured approximately 96 core functionalities by differential clustering; 73 of which could be traced back to probable splits in orthologous gene clusters of which half appear to be contributed to by the most evolutionary distant species, P. propionicum. Therefore, a functional-annotation of the pan-genomic matrix was generated to complement the orthologous matrix in a two-step process (S5, Additional file 4). We first utilised MATLAB to computationally combine entries with exact name matches. Secondly, the matrix was manually perused to correct entries which were similarly annotated but not exact matches due to a lack of annotation convention. This process was assisted by utilising EC numbers and primarily the MetaCyc database [71] to provide lists of synonymous enzymatic names. Little effort was spent curating hypothetical and putative proteins as these were expected to contribute negligibly to the functional understanding of Propionibacterium metabolism. Overall, the functional annotation collapsed the pan-genomic matrix by about two thirds while the number of core clusters remained roughly constant. Visualization of the orthologous and functional pan-genomic matrices was performed using Cluster 3.0 [72] and TreeView [73] (Figures 5 and 6, S1, Additional file 1.
Figure 5
Genome-scale metabolic model and Pan-GEM construction
The original annotations of the complete genomes were further processed through the KBase pipeline to draft-reconstructions (Figure 5). Due to the large number of P. acnes strains with little functional variability, suggested to be around 3–4% by the functional pan-genomic matrix, a model was only generated for P. acnes 6609. At the time of writing, it was not possible to either manually curate models using KBase or to upload a working, manually curated model into the framework which made it difficult to utilise any additional features of the pipeline. We instead utilised chemically defined media formulations from experiments or the literature to initially gap-fill reconstructions using the standard biomass template reaction; or complete media formulations where these were not available. A chemically defined media was developed in house for P. acidipropionici which included several trace elements, glucose as a carbon source, nitrate as a nitrogen source and phosphate as a source of phosphorous and the vitamins biotin, pantothenate, riboflavin, thiamin and vitamin B12.
A consensus biomass equation was next constructed by datamining the available literature. Most investigations have focussed on the unusual lipid and cell wall composition [74–76] of propionibacteria but other studies on the macromolecular composition of Propionibacterium were absent. We therefore composed a consensus biomass equation for Propionibacterium that captures the predominant composition of the genus based on current knowledge, while acknowledging that there will be significant variability between different species; for example, the peptidoglycan structure is variable in the genus. DNA composition was determined from the genome sequence and RNA composition was estimated based on the P. acidipropionici 4875 RNA sequencing data. Where unknown, the macromolecular composition and protein compositions were complemented with quantitative data from Streptomyces coelicolor [77]. After finalising a biomass equation (S10, Additional file 6), previously added non-essential gap-fills were removed.
The pan-GEM represents the complete metabolic functionality of the individual reconstructions generated for each species. To create the pan-GEM, the individual metabolic reactions from each model are consolidated into a single supra-model with conserved associated gene annotations (S3, Additional file 3).
Pan-genome guided genome-scale metabolic reconstruction
The draft reconstructions developed were of low quality, containing numerous incorrect annotations and lacking key central carbon metabolic reactions including reactions of the Wood-Werkman cycle, highlighting the shortcomings of utilising popular automated annotation pipelines for non-model organisms. A combined pan-genomic guided annotation approach was used to functionally correct the annotations by incorporating functional information contained within the models, pan-genomic information and available data and is summarised in Figure 5. Briefly, the metabolic models gave functional cues to guide the annotation, such as gap-fills, blocked reactions, differing metabolic capabilities between organisms or testing catabolic capacity for experimentally verified nutrient sources including phenotype array information. Metabolism was explored on a pathway basis using pathways defined in MetaCyc [71] or subsystems from RAST to guide the search for potentially missing functionalities. Functional cues also come from the literature which contains strain specific information for certain genes and enzymatic capabilities. Another primary source of annotation cues comes from genomic-context of the orthologous and functional matrices. The genomic-context benefits the annotation by enhancing resolution and allows the identification of inconsistent annotations, potentially missed or incorrect annotations when they are present or absent in a small number of species and provides additional evidence for an annotation; such as if a function is flanked by genetic elements with related functions in some strains or highly conserved genomic regions. Analysis of the functional orthologous matrix allows the identification of reactions which were not captured in the model due to pipeline limitations and helps identify incorrect annotations such as inconsistent numbers of subunits of a particular complex. Potentially absent or incorrectly annotated genes were then probed by identifying sequences from closely related strains and by using BLAST to compare these sequences against a local database. This database contained all modelled Propionibacterium protein sequences and genomes and could be used to query the propionibacteria pan-proteome or look for unannotated sequences in the pan-genome. Results are then validated against orthologous gene clusters and functionality is assigned on a cluster basis.
This manual curation approach was applied finitely and focussed on pathways of the central carbon metabolism, particularly around the central phosphoenolpyruvate-pyruvate node and eventually expanded to include the respiratory metabolism and more specific metabolic traits of Propionibacterium including polyphosphate and trehalose metabolism. Particular attention was also given to amino acid metabolism which can influence fermentation products [78]. By referencing the literature and the BRENDA database [79], known enzyme promiscuities that could fill functional gaps were identified, cofactor usage of reactions in the model were re-assessed and reaction reversibility was adjusted based on biochemical evidence and modelling results. For example, the irreversibility of the pyrophosphate generating reactions through the actions of intracellular pyrophosphatases was set in all models. For a brief guide of the manual curation of genome-scale metabolic models from the RAST framework, refer to S11 (Additional file 1).
Phenotype array data generation and integration
Phenotypic growth array data (S6, Additional file 5) for P. acidipropionici 4875 was generated and used to refine the model using an in-house gap-filling algorithm (manuscript in preparation). The phenotype array and methodology was provided by Biolog, Inc. and was performed in accordance with their Streptomyces protocol with the modification of 2 µg/mL of known essential Propionibacterium vitamins; thiamine, biotin, pantothenic acid, riboflavin and vitamin B12. All additives to the array were defined except the proprietary inoculating fluids. P. acidipropionici 4875 grew on 110 of the 189 carbon sources tested of which 91 could be mapped to compounds in the ModelSEED database. Transporters for these 91 compounds were then added to the P. acidipropionici 4875 model. Next, the complete database of ModelSEED reactions was used to test if each of these compounds could sustain growth and where not possible, they were discarded from the analysis (14). Essential reactions required for growth on any of the conditions were then identified and added to the analysis. Finally, the minimal number of reactions needed to sustain growth across all conditions was simultaneously computed using a mixed-integer linear programming formulation. A final round of manual curation was performed such that all reactions incorporated into the model based on this work were either supported by genes, enzyme promiscuity information, were essential or otherwise the only reasonable candidate and results were propagated across other reconstructions where supported by genetic or biochemical evidence. The degradation pathways of phenylalanine and two related aromatic compounds completely lacked any genetic support for either of the currently known pathways for degradation and were therefore excluded.
Clustergram generation
Comparison of the biosynthetic and catabolic potential of amino acids in propionibacteria was performed by generating clustergrams. Reactions associated with the catabolism or anabolism of each amino acid were grouped and their presence or absence scored in each reconstruction. Differences between strains were quantified using the Jaccardian similarity coefficient. Details of these pathways, brief functional assessments and Jaccard score computation can be found in S12, Additional file 7. The Jaccardian similarity coefficient was utilised as a distance metric to generate the clustergrams in MATLAB.
Bioreactor cultures and ‘omics analysis
Propionibacterium acidipropionici ATCC 4875 was obtained from the American Type Culture Collection. Cells stored at –80 ˚C, were used to inoculate serum bottles containing 50 mL PAM with 40 g/L of the relevant carbon source. Cultures were incubated statically at 32 ˚C for 24 hours and used to inoculate Applikon Ez-Control bioreactors containing PAM and 100g/L of glucose, sucrose or glycerol. All experiments were performed in duplicates. Reactors were maintained at 32˚C, pH 6.5 with 5 M NaOH, agitated at 200 rpm and the head space was degassed with nitrogen to maintain anaerobic conditions throughout the fermentation. Regular samples were taken to measure optical density using a Biochrom Libra S12 spectrophotometer and for organic acid analysis. Cells were harvested for RNA and protein extraction during mid-exponential phase. Total RNA and mRNA enrichment, sequence alignment and differential expression analysis was performed as described in [80]. For proteomics,cell pellets were lysed and trypsin digestion was performed as previously described in [81] andSWATH based quantification was performed as described in [82]. The differential abundance of proteins with more than 2 peptides with a 95% confidence score and a false discovery rate less than 1% was assessed using the LIMMA package. Extracellular organic acids, carbohydrates and alcohols were quantified by ion-exclusion chromatography as described in [27].
PFOR Knockout
The PFOR encoded in nifJ1 gene (PFREUD_RS00925, CDS.199 this study) was knocked out in P. freudenreichii subsp. shermanii using CRISPR/Cas9 by interruption of the gene with gfpUV. Briefly, we assembled plasmid pPAC_Cas9 using pRGO1 as the plasmid backbone and introducing the lacZ-MCS region, an erythromycin resistance gene (ermE) and a GC-optimised Cas9 under the control of promoter PermE. Gibson assembly was then used to create the knockout vector, pCas9_nifJ (Figure 6), by adding homologous arms designed to span 1 kb up and downstream of a PAM region identified in the nifJ gene, the gfpUV sequence and gRNA under the control of the P130 promoter from propionibacteria. The homologous arms and gfpUV were PCR amplified from P. freudenreichii subsp. shermanii and the pBRPprp_gfpuv plasmid respectively. Primer sequences and the 20-nt region for the gRNA are shown in Table 5. All cloning was performed in E. coli DH5α (Bioline). PCR reactions were performed using Phusion Taq (NEB). Introduction of plasmid pCas9_nifJ into P. freudenreichii subsp. shermanii was performed by electroporation as previously described [28]. After 3 h recovery on PAM media (sucrose, 40 g/L; trypticase soy, 5 g/L; yeast extract, 10 g/L; K2HPO4, 0.05 g/L; MnSO4, 0.05 g/L), cells were plated on NLB (sodium lactate, 10 g/L; trypticase soy, 10 g/L; yeast extract, 10 g/L; glucose, 10 g/L) agar and incubated in an anaerobic jar at 30 C. After 10 days, colonies were identified and tested by colony PCR to confirm the successful knockout. Positive ∆nifJ colonies were then grown on NLB liquid media for 72 h.
Figure 6