Bacterial Strain
We used Escherichia coli K12 MG 1655 strains where the expression of the mutS gene is directly controlled by the arabinose promoter pBAD inserted in front of the mutS gene. In absence of arabinose, mutS is not expressed, leading to a higher spontaneous mutation rate due to the inactivation of the methyl-directed mismatch repair system (MMR,(45)). Additionally, our strain had a GFP marker located in the lac operon, which can be induced by IPTG (Isopropyl β-D-1-thiogalactopyranoside)
Experimental setup
Twelve bacterial strains were propagated on LB agar plates at 37°C for a total duration of 39 days. The strains were transferred on new agar plates every 3 days (Figure 5). An image of the colony was taken before transferring the strains to a new plate. The location of the sampling point of each transfer was chosen at random on the periphery of the colony. At each transfer, a sample containing about 100 million cells was collected from the colony front using a sterile pipette tip and resuspended in 100 µl 0.85% NaCl solution. About one million cells were then used to inoculate a new plate (Figure 5B). This expansion experiment on several plates aims at mimicking a continuous expansion for 39 day or 1650 generations (Figure 5C). We extracted DNA from six lines during each of the 13 transfers, and for six other lines, we extracted DNA at day 3, 12, 21, 30, and 39. We thus analysed a total of 108 DNA samples from the 12 lines (Figure 5A).
DNA extraction
After the range expansion experiment on agar, one million cells from the wave front were streaked out on an LB agar plate containing 0.5% arabinose and incubated for 24h at 37°C to isolate single clones. A single colony was dissolved in 100 μl dilution solution (0.85% NaCl) and 1 µl was transferred to a new LB agar plate containing 0.5% arabinose. The plate was then incubated for 24h at 37°C. Then, the entire colony was removed from the agar plate and resuspended in 1 ml dilution solution. Genomic DNA was extracted using the Wizard Genomic DNA Purification Kit (Promega) following the manufacturer protocol. The integrity of the DNA was checked by gel electrophoresis. The DNA concentration was determined by fluorometric quantification (Qubit 2.0).
Whole genome sequencing and variant calling
108 DNA samples of 12 lines were sequenced using a TruSeq DNA PCR-Free library (Illumina) on a HiSeq 3000 platform (Illumina), from which we obtained 100bp paired end reads for all samples. Trimmomatic 0.32 (46) was used to remove the adapter sequences from the reads and for quality trimming. Leading and trailing bases with quality below 3 were removed. The reads were scanned with a 4bp sliding window and cut if the average quality per base was below 15. Reads with a length below 36 were excluded from the analysis. Variants were identified using BRESEQ (version 0.27.2), a computational tool for analyzing short-read DNA data (47). BRESEQ uses Bowtie2 (Langmead, et al. 2009) to map reads to the Escherichia coli K12 MG1655 (NC_000913.3) reference genome. As a first step, it identifies potential new junctions between disjoint regions of the reference sequence using all available reads. BRESEQ then uses an empirical error model for base quality re-calibration considering the identity of the reference base, the identity of the mismatch base, the base position within the read, and the neighboring base identities. At each alignment position, BRESEQ calculates the posterior probability of a given nucleotide given the observed aligned reads. If the nucleotide with the highest posterior probability is different from the reference, BRESEQ records read alignment evidence. The top/bottom strand distribution of reads supporting the major base is compared to the top/bottom distribution of reads supporting the minor base by using a Fisher’s Exact Test to avoid false-positive polymorphism prediction due to sequencing-error hotspots in reads on one strand. A one-sided Kolmogorov-Smirnov test was used to test whether base quality scores supporting the minor mutational variants are suspiciously lower than the base quality scores supporting the major variant. We excluded two sample after analyzing the DNA sequences due to potential contaminations.
Estimation of dN/dS ratio
The number of synonymous and non-synonymous substitutions were computed in each line. The dN/dS ratio was then estimated by taking the expected number of synonymous and non-synonymous substitutions into account if all codon positions in the reference genome would have mutated. We used a bootstrap approach to test if the dN/dS ratio is significantly different from 1. dN/dS was computed using randomized data sets in which the mutations were randomly sampled with repetition among six types of non-synonymous and six types of synonymous mutations (four possible transition and two possible transversions).
Analysis of colony size and number of mutations
We determined for each time point (Figure 5A) the number of mutations that have accumulated in each of the 12 lines, as well as the corresponding colony size. After exclusion of one line due to contaminations we were left with 103 measurements of 11 lines between 3 and 39 days. We determined the change in colony size and the change in the number of mutations over time by fitting a mixed-effect linear model to the data. We fit a fixed effect slope to the data that describes the effects common to all lines, and the model also considers line-specific variability in the slope by including random effects bi for the intercept and slope for the i-th line:
where Xi and Zi are known fixed effect and random effect regressor matrices, εi is the within group error with a spherical Gaussian distribution, and Ψ is the variance-covariance matrix of the random effects.
Two types of determination coefficients (R2) can be calculated for mixed effect regression models. The marginal represent the variance explained by the fixed effects of the model, whereas the conditional represents the variance explained by the entire model (with both fixed and random effects). The r.squaredGLMM function of the R package MuMIn was used to calculate and . In the model for the change in colony size over time and the model for the change in number of mutations the addition of a random effect of the slope significantly improves the fit of the models compared to a model with only a random effect for the intercept (Likelihood ratio tests; for the colony size model: p-value= 0.0017; for the number of mutation model: p-value= < 0.0001).
Effect of mutations on the colony size
The difference in colony size (Δc) of two consecutive expansions on agar plates, each of these expansions lasting for three days, was calculated for all lines and the mutations that accumulated during this period were determined. Only non-synonymous, frameshift, and non-sense mutations were considered, and for each Δc, the number of mutations (M) in every gene was determined. M has the same number of rows as the change of colony size Δc and 888 columns, one for every gene that had at least one mutation during the experiment. We used a regression approach to model the change in colony size Δc with the number of mutations in the genes M:
Δc = M + ε
where ε is the vector of residuals.
To avoid overfitting due to the high dimensionality of M, ridge regression was used to estimate the effect of a mutation on colony size in a given gene. If a mutation in a gene has no effect on the colony size, ridge regression shrinks the coefficient close to zero. Positive coefficients indicate an increase of colony size and negative coefficients indicate a decrease. The shrinking of the parameters is controlled by the regularization parameter λ, whose value was chosen by 3-fold cross-validation using the cv.glmnet function of the glmnet package.
Gene ontology enrichment test
We tested if there was a signal of adaptation during different periods of the experiment by using a gene ontology (GO) enrichment analysis where we only used non-synonymous, frameshift and nonsense mutations in each time period. The test was performed with the topGO package for R (48) on the genes that were detected to have a positive coefficient in the ridge regression. The resulting list of genes was used separately to perform a Fisher’s exact test to determine significantly over-represented GO terms. The weight01 algorithms used in the topGo analysis iteratively removes the genes mapped to significant GO terms from higher level GO terms and the significance score of connected nodes are compared to detect the locally most significant terms in the GO graph by down-weighting genes in less significant neighbors. The GO enrichment was applied separately to the following time periods: days 0-12, days 12-21, days 21-30, and days 30-39