Evaluation of assembly reproducibility according to sequencing quality using simulated data
A key requirement for sharing data between interoperable surveillance systems is to evaluate the repeatability and reproducibility of analysis and to propose quality criteria for data inclusion. The assembly tools chosen (SPAdes, unicycler and Shovill) were selected because they have been frequently used in recently published workflows dedicated to bacterial WGS. We evaluated the impact of read quality on sequencing simulations for 27 bacterial species, and observed that poor data quality (Q < 40) decreases the quality of assembly: Assemblies were impossible to draft with Shovill, because the tool did not accept input data, or were shorter and more fragmented with SPAdes and Unicycler (Supplementary data S1). For Vibrio parahaemolyticus, the maximum number of contigs was 80 with high-quality data (Q > 40) but increased to 120 with poor-quality data. For some species, such as Bacillus cereus, Clostridium perfringens, Taylorella Mycobacterium tuberculosis, and Ralstonia solanacearum, some genome parts were even missing from the final assembly obtained with a poor read quality (Supplementary data S2), in position 0 Mb for Bacillus cereus, 0.1 Mb for Clostridium perfringens, 4.0 Mb for Mycobacterium tuberculosis, and 2.8 Mb for Ralstonia solanacearum.
Furthermore, the poor quality of reads also increased genome misassemblies compared with results obtained with a high read quality. Indeed, in Klebsiella aerogenes, at a depth of 75x, the maximum percentage of misassemblies was 40% with poor-quality reads whereas the figure for high-quality reads could drop as far as 0%. For example, in Mycobacterium bovis, there were 20% of misassemblies with poor-quality reads vs. 0% with high-quality reads; in Neisseria meningitides these figures were 40% (poor quality) vs. 20% (high quality); in Staphylococcus argenteus they were 20% (poor quality) vs. 0%; and in Bacillus cereus, 20% (poor quality) vs. 7%. For Clostridium perfringens, the rate of misassemblies obtained with a poor read quality could reach 60% in some assemblies. For other species such as Campylobacter spp., Listeria monocytogenes, Escherichia coli or Vibrio cholerae, assembly results appeared to be less affected by a poor read quality (Supplementary data S1).
When we compared the impact of various sequencing depths, we observed an optimal threshold at 75x. At this value, parameters representing high-quality assembly are maximized, i.e., the number of contigs and misassemblies decrease, and both N50 and total length increase. Mahn-Whitney tests used to compare the four-parameter distribution obtained at different sequencing depths were significant (Table S4). Results with 150x and 100x were identical. Comparing 25x with 100x, contig number distributions were significantly different for 10/27 species, N50 distributions significantly different for 21/27 species, misassemblies for 25/27 species, and largest contig for 16/27 species. For 50x, no difference was observed in contig number, N50 and largest contig, while misassembly distributions were different for 10/27 species. For 75x, no difference was observed in contig number, N50, and largest contig, while misassembly distributions were different for 6/27 species. Therefore, for the subsequent analyses, we present results derived from high-quality reads at a depth of 75x (Supplementary data S3).
Comparison of assembly tools with a high read quality and sufficient depth using simulated data
To determine which tool performs better in genome assembly, SPAdes, Shovill and Unicycler were compared using simulated sequencing data with a high quality and mean depth of 75x. Our results indicated that assembly repeatability does not depend on the tools used but instead appears to be genome-dependent. An alignment of the generated assemblies to the reference used for the sequencing simulation revealed that both Shovill and Unicycler performed better for Listeria monocytogenes and Ralstonia solanacearum than for most the 27 bacterial species (Fig. 2A). Interestingly, these tools fragmented the genome into similar genomic regions, which seem to correlate with variations in GC content across the genome. However, assembling the genome of Mycobacterium bovis and Xylella fastidiosa with the same assembly tool led to different results (Fig. 2B). Specifically, for these two species, assembly replicates obtained from the same simulated dataset produced identical contigs, as was observed for all studied genomes in our dataset, but for these two species, the assembly differed for each sequencing simulation dataset (i.e., read simulations obtained from the same genome).
Impact of assembly tools on cgMLST profiles using simulated data
Once the optimum quality criteria for sequencing were determined, the impact of cgMLST analyses was evaluated for 21 species for which a cgMLST scheme was available. The cgMLST profiles obtained from high-quality sequencing (i.e., Q > 40) with sufficient depth (i.e., depth = 75X) classified bacterial species into two categories based on the allelic difference rates observed between the reference genome and the assemblies obtained (Fig. 3). Results from SPAdes consistently exhibited higher assembly fragmentation and misassemblies than those obtained with Shovill and Unicycler, and are not therefore presented here. The first category (group 1) comprised 14 out of 21 bacterial species that had less than 5% of errors between the reference and the assembly obtained. For group 1, results suggested that the choice of assembler should vary according to the species studied (Fig. 3A). Indeed, for Escherichia. coli, Mycobacterium tuberculosis, Vibrio cholerae, and Taylorella equigenitalis, a significant difference (p-value < 5% for Mann-Whitney test) was observed between Shovill and Unicycler results, suggesting that Shovill gave cgMLST profiles closest to the reference. However, for Neisseria meningitidis and Leptospira interrogans, the allelic profiles were closest to the reference when Unicycler was used, although no significant difference was observed when checked with the Mann-Whiney test.
The second category (group 2) comprised 7 out of 21 bacterial species for which the number of allelic differences between the reference and the assembly obtained was greater than 5% (Fig. 3B), with a maximum of 30% for Salmonella enterica. Within group 2, few differences were observed between the results obtained from Shovill and Unicycler assemblies, suggesting that the choice of assembly tool may be negligible compared with the intrinsic genome composition, except for Campylobacter spp. for which a significant difference was observed between distribution results from the two tools.
Comparison of cgMLST profiles obtained with different sequencing depths using simulated data
Related strains were identified by clustering cgMLST profiles obtained with different data quality and depth combinations. In open-source surveillance systems or applications, various data qualities can be shared with the science community with diverse internal sequencing capacities and/or quality thresholds. To evaluate the impact of various sequencing depths on cgMLST results, we compared simulated sequencing data associated with mean depths of 25x, 50x, and 75x. The number of allelic differences between reference cgMLST profiles and cgMLST profiles obtained significantly increased for assemblies with a sequencing depth less than 75x for all species belonging to group 1 (Fig. 4a). Only four out of 21 bacterial species, all belonging to group 2 previously described (i.e., greater than 5%), appeared not to be impacted by the quality of sequenced data: Bacillus cereus, Bacillus cytotoxicus, Bacillus thuringiensis, and Vibrio parahaemolyticus (Fig. 4b), as no significant difference was observed. However, for other species—regardless of whether they belong to the first or second group previously described—the number of allelic differences was significantly higher with poor depth (Q < 40) using simulated sequencing data. These results underscored the importance of performing genomic typing on harmonized, high-quality data with a sufficient sequencing depth to investigate outbreaks.
Confirmation of reproducibility and repeatability when sequencing real data
To confirm the poor repeatability and reproducibility of cgMLST results obtained using simulated sequencing data and evaluate the impact on real data, we analyzed biological replicates of bacterial strains from six species. The cgMLST profiles were computed for each biological replicate to evaluate reproducibility, and bioinformatics analyses were performed in triplicate to investigate repeatability.
The cgMLST profiles obtained using real data showed that the results were repeatable between analyses, as also observed with simulated sequencing. Indeed, the cgMLST profiles resulting from SPAdes and Unicycler assemblies were comparable between each replicate, indicating 100% repeatability, as no distance was observed between assemblies obtained from the same raw data (Fig. 5). However, poor reproducibility was observed between the biological replicates, with distances observed between the same strains for which raw data were provided from two independent extractions. This finding suggests that the wet lab part has a major impact on cgMLST profiles, despite using the same DNA extraction protocol for Salmonella enterica, Staphylococcus aureus, and Xylella fastidiosa. Indeed, only four out of 28 strains had identical profile results with Unicycler. With Shovill, repeatability seemed to be dependent on the species. For instance, for Listeria monocytogenes all analyses were 100% identical, whereas for Staphylococcus aureus, Vibrio parahaemolyticus, and Xylella fastidiosa the strains had different cgMLST profiles resulting from distinct assemblies. For Salmonella enterica and Bacillus thuringiensis, one and two strains, respectively, gave different cgMLST profiles between analyses, but only one gene was systematically affected.
The cgMLST profiles for biological replicates were found to be identical for eight out of 28 analyzed strains (Fig. 5). These eight strains belong to Bacillus thuringiensis (two out of five strains), Listeria monocytogenes (four out of five strains), Vibrio parahaemolyticus (one out of four strains), and Salmonella enterica (one out of five strains). This level of reproducibility was mainly observed for the results generated by SPAdes and Unicycler, although only the Unicycler results maximized the completeness of the cgMLST scheme, i.e., more genes in the cgMLST scheme were found after Unicycler assembly. Conversely, with Shovill, only five strains had the same cgMLST profiles for biological replicates (one Bacillus thuringiensis, and four Listeria monocytogenes), and only four strains gave profiles that were identical to the Unicycler results (one Bacillus thuringiensis and three Listeria monocytogenes).
The number of allelic differences between biological replicates was found to be elevated (22 allelic differences between two Listeria monocytogenes replicates or 184 between two Staphylococcus aureus replicates), suggesting potential ambiguity for closely-related strains (Fig. 5). Depending on the species and assembly tools used, the number of allelic differences between biological replicates varied significantly, ranging from 10 allelic differences for Bacillus thuringiensis, to 138 for Salmonella enterica with Unicycler. Results obtained for two closely-related strains of Xylella fastidiosa subsp. multiplex, both belonging to ST6 based on the MLST of seven housekeeping genes (Amandine Cunty, personal communication), were mixed for cgMLST results, whereas they were found to be distinguishable in SNP analyses (data not shown). These results suggested that for outbreak investigations using this method, it may be challenging to discriminate the strain responsible for the outbreak and consequently determine its source.