The worldwide incidence of colorectal cancer (CRC) has shown a concerning upward trend, particularly in Japan and certain other regions1. The rise in incidence is commonly attributed to an increasingly westernized lifestyle, but the prevalence in Japan surpasses that in the US and Europe, indicating the existence of other contributing factors. Furthermore, the incidence of early-onset CRC, diagnosed before the age of 50, has been steadily increasing worldwide since the 1990s2,3.
One crucial aspect of CRC is its direct connection to the gut microbiota4. Enteric bacteria such as Escherichia coli containing the polyketide synthase (pks) operon5 and enterotoxigenic Bacteroides fragilis6 have been proposed as potential catalysts for CRC initiation, attributed to DNA damage and oxidative stress. Bacteria such as Fusobacterium nucleatum, Peptostreptococcus stomatis, and Parvimonas micra7,8,9, which are oral pathogens, have been found in the stools of CRC patients and may aid in the diagnosis of CRC.
Colibactin, a genotoxin that induces DNA interstrand cross-linking10 and double-strand breaks11, is synthesized by E. coli and other enteric bacteria containing pks islands12. Colibactin-induced DNA damage produces tumor mutational signatures such as SBS88 and ID18, which aggregate somatic DNA mutations in CRC13,14,15.
Identification of CRC subtypes based on gut microbiota composition
A total of 138 treatment-naïve patients diagnosed with histologically confirmed CRC were enrolled (Extended Data Table 1). This study did not include cases of hereditary CRC. Excised CRC tissues were fresh-frozen, and lymphocyte DNA and paired stool samples were also acquired. Tumor tissue and lymphocyte DNA were subjected to whole-genome sequencing (WGS), while stool samples were subjected to whole-genome metagenomic sequencing (WGMS). Healthy controls (HCs) were defined as individuals who underwent colonoscopy for reasons such as positive fecal occult blood, but had no significant findings. The data on HCs were collected from a sample of 146 individuals previously studied (Extended Data Table 2)9,16. The Brinkman Index was significantly higher in the CRC patients compared to HCs (P = 1.7 × 10-5) (Extended Data Table 3).
Using WGMS data, we utilized the SHAP (SHapley Additive exPlanations) explainable AI framework, in accordance with our previous report16. P. stomatis, F. nucleatum, Dialister pneumosintes, and P. micra were identified as the predominant bacteria differentiating CRC patients from HCs (Fig. 1A). The number of clusters was selected using the elbow method (Extended Data Figure 1). In Fig. 1B, we colored the points on the scatter plot according to the predicted CRC probability obtained from the classifier output. Principal component analysis (PCA) identified four variable clusters in all CRC patients (Fig. 1C). The clusters were categorized as Subtypes 1-4, and the probability of CRC was compared for each subtype. Subtype 2 had the highest CRC probability (P = 1.0 × 10-10, Wilcoxon rank sum test, one-sided), while Subtype 1 had the lowest (P = 9.1 × 10-15, Wilcoxon rank sum test, one-sided). F. nucleatum and P. stomatis significantly influenced Subtype 2, as indicated by a mean SHAP value of approximately 0.03 (Fig. 1D).In contrast, Subtype 1 was distinguished by the absence of specific bacterial species, as none of the bacterial species had a mean SHAP value greater than 0.01. P. stomatis and F. nucleatum were found to be highly distinctive bacterial species in Subtype 3 and Subtype 4, respectively, and D. pneumosintes was the second most significant bacterial species in both subtypes.
We applied this model to our cohort from the previous paper (Stage 1/2/3/4, n = 180)16 as the Validation cohort and subsequently partitioned it into four groups (Extended Data Figure 2). The Kaplan-Meier curves for overall survival indicated that Subtype 3 had a significantly worse prognosis than the other subtypes in both the Discovery (P = 0.034, log-rank test) (Fig. 1E) and Validation cohorts (P = 0.0034, log-rank test) (Fig. 1F).
Differentiation of clinical characteristics and genomic abnormalities by microbiome-based subtypes
We analyzed clinical features and genomic abnormalities using WGS and transcriptome sequencing. The WGS analysis showed that Subtype 2 had significantly more non-human bacteria reads than the other subtypes, indicating a higher level of microbiota in Subtype 2 (Extended Data Figure 3). Fig. 2A shows detailed clinicopathological information (Extended Data Table 4) and genomic alterations for 138 patients. Alcohol history was significantly more frequent in Subtype 1 than in other subtypes (P = 0.039, Fisher's exact test). There was a statistically significant decrease in smoking history within Subtype 2 compared to the other subtypes (P = 0.034). The patients in Subtype 3 had a younger age (median, 55 years old) compared to the patients in other subtypes (P = 0.029, Wilcoxon rank sum test, one-sided), and the patients in Subtype 2 were older (median, 65 years old) (P = 0.034, Wilcoxon rank sum test, one-sided) (Figure 2B, Extended Data Table 5).
Subtype 2 was characterized by a significant enrichment of ARID1A, B2M, and PTEN alterations (P = 0.0071, 0.037, and 0.037, OR = 5.05 [95%CI: 1.55-18.13], 5.06 [95%CI: 1.12-23.00] and 5.06 [95%CI: 1.12-23.00], respectively, Fisher’s exact test), as well as a hypermutated phenotype (more than 500 non-silent SNVs [15 SNVs/Mb] or 100 coding indels [3 indels/Mb]) (P = 0.044, OR = 3.72 [95%CI: 1.04-13.03], Fisher’s exact test) in CRCs (Fig. 2A, Table 1). Moreover, Subtype 2 was found to exhibit a high degree of similarity in characteristics to the consensus molecular subtype (CMS)17-based CRC classification CMS1 (Extended Data Table 6).
We then analyzed somatic copy number alterations within the four CRC subtypes. In Subtype 1, the occurrence of amplifications was significantly more frequent than in other subtypes (P = 0.037, Wilcoxon rank sum test, one-sided) (Extended Data Figure 4).
Analysis of mutational signatures and their correlation with gut microbiome
The WGS data from 138 CRC cases were subjected to mutational signature analysis. By decomposing single base substitution (SBS) and short insertion and deletion (ID), we identified 16 SBS signatures, including two novel ones (Extended Data Figure 5A, 5B), and five ID signatures (Extended Data Figure 6A, 6B). We observed a significant increase in the prevalence of specific mutational signatures associated with defective DNA mismatch repair (dMMR) (i.e., SBS44 and ID2) in Subtype 2 compared to other subtypes (P = 0.013 and 0.047, respectively, Wilcoxon rank sum test) (Extended Data Table 7).
The colibactin signatures (i.e., SBS88 and ID18) showed a significant correlation with the occurrence of colibactin-specific motifs (Fig. 3A, 3B). Surprisingly, the presence of the colibactin-related mutational signatures (SBS88 and/or ID18) had a significant impact on the occurrence of CRC mutations in this Japanese dataset, specifically in 97 out of 138 patients (70.3%) (Fig. 3C, Extended Data Figures 5C and 6C). Conversely, the hypermutated phenotype of CRC exhibited a limited number of colibactin signatures(Extended Data Table 8). Within this cohort, two of three patients diagnosed with Stage 0 CRC (carcinoma in situ) displayed the colibactin signature (i.e., ID18).The incidence of colibactin-related mutational signatures among CRC patients is remarkably higher than in previous reports from other countries (approximately 5-10%)12,18,15. In younger patients (age ≤ 40) and in Subtype 3, the colibactin signature showed a significantly higher contribution (P = 0.0033 and 0.022, respectively, Wilcoxon rank sum test) (Extended Data Table 9 and Fig. 3D). Furthermore, there was a significant increase in the contribution in patients with left colon CRC, Body mass index (BMI) < 30, CRC stage ≥ 3, and rectal cancer (P = 0.0018, 0.035, 0.042, and 0.0030, respectively, Wilcoxon rank sum test) (Extended Data Tables 10-13).
A recent report showcased a comprehensive analysis of CRC through WGS as part of the Genomics England project15. According to the report, SBS93 and ID14 have been identified as the fourth most frequently observed SBS in CRC. It also shows that SBS93 was more common in young-onset CRC. Interestingly, our study demonstrated that a significant mutual exclusivity was observed between the colibactin signatures (SBS88 and ID18) and ID14 and SBS93 (P = 0.0012 and 0.025, respectively, based on Fisher's exact test) (Extended Data Table 14). It has been suggested that the etiological factors responsible for CRC characterized by the SBS93 mutational signature differ from those linked to the SBS88 mutational signature.
When hotspot mutations seen in three or more cases were investigated, one APC hotspot mutation displayed the characteristic SBS88 mutational signature and possessed a colibactin-specific motif (Fig. 3E). The hotspot mutation is located within intron 8 of APC, resulting in a novel AG splice site. This leads to a 7 bp insertion and a frameshift in APC (Fig. 3E). Out of 138 patients, 7 (5.1%) had this hotspot mutation, and of these 7 patients, 6 had two hits, showing another inactivating mutation or loss of heterogeneity (LOH) of APC(Extended Data Table 15). The data indicate a significant increase in the contribution of SBS88 in these seven patients compared to other patients (P = 0.0062) (Fig. 3F), providing strong evidence that SBS88 played a role in the formation of these APC hotspot mutations. In addition, 15 other patients exhibited non-silent mutations in significantly mutated genes that matched the colibactin signature mutation pattern (i.e., SBS88 and ID18), which included 10 APC frameshift deletions (Extended Data Table 16). Among the ten APC deletions, 5 exhibited the colibactin-specific ID motif with two adenines and three were hotspot deletions (Fig. 3G). Analysis of the correlation between the colibactin signature and patients with each gene mutation in non-hypermutated CRC revealed an inverse relationship between KRAS mutations and the colibactin signature (i.e., SBS88 and ID18) (Extended Data Table 17).
Using WGMS, we found the colibactin-producing gene cluster in 26 (19.3%) of the fecal samples (Extended Data Figure 7A). No correlation was found between the amount of the colibactin-producing gene cluster in stool and the proportion of colibactin-related mutational signatures (Extended Data Figure 7B, 7C) or the colibactin gene (clbP) copy number in tumor tissue (Extended Data Figure 8A, 8B). We confirmed a high correlation between the copy number of the colibactin gene (i.e., clbP) in stool estimated through qPCR and that estimated through WGMS (Extended Data Figure 7D). The presence of the clbP gene, which encodes the atypical peptidase ClbP that contributes to colibactin genotoxin synthesis19, in CRC tissues was examined by quantitative PCR20,21,22. Notably, the clbP gene was found in all 138 samples of CRC tissue, as well as in all 12 non-cancerous colon mucosa tissues that were investigated (Extended Data Table 18). Among the tumor tissue samples, the top three cases with the highest abundance of clbP gene highlighted a substantial contribution of the colibactin-related mutational signature (Extended Data Figure 8C). However, most cases with a notable colibactin-related mutational signature showed low abundance of clbP in the tumor tissue (Extended Data Figure 8C-8E), suggesting that evaluation of the prevalence of colibactin-producing genes in a single sample may not accurately reflect the cumulative colibactin-related mutational signature.
Timing of colibactin signatures and their association with early-onset CRC
We analyzed the distribution of early and late clonal and subclonal mutations associated with specific mutational signatures by employing the methodology outlined by Gerstung et al 23. Due to variations in the ratio of early/late clonal and clonal/subclonal mutations among different mutational signatures, the analysis revealed that colibactin-related mutational signatures demonstrate a greater prevalence of early clonal mutations than late clonal and subclonal mutations (Fig. 4A, Extended Data Figure 9). Among the 106 cases exhibiting the early clonal SBS88 signature, there was an inverse correlation between the contribution of SBS88 and age (correlation coefficient = -0.41, P = 1.1 × 10-5) (Fig. 4B). The 30 patients with the highest rate of early clonal SBS88 mutations were significantly younger than the other patients (P = 2.9 × 10-4) (Fig. 4C).
Using somatic mutations and copy number (CN) variations, we were able to estimate the relative timing of three types of copy number gains: 1, copy number neutral loss of heterozygosity (CNN-LOH)/ loss + gain (N:0); 2, mono-allelic gains (N:1); 3, bi-allelic gains (N:M)23. Subtype 3 demonstrated an earlier occurrence of CNN-LOH/loss + gain than the other subtypes (P = 0.030, Wilcoxon rank sum test, one-sided) (Fig. 4D).Nevertheless, no significant association was found between mono-allelic gains and bi-allelic gains (Extended Data Figure 10). Additionally, Subtype 2 displayed a delayed occurrence of CNN-LOH/loss + gain compared to other subtypes (P = 0.036, Wilcoxon rank sum test, one-sided) (Fig. 4D), irrespective of the patient's age.
An in-depth case study of a CRC patient in Subtype 3 highlighted the significant involvement of colibactin signature SBS88 (approximately 10%) in the patient's disease progression. Fig. 4E displays the findings of CM099, a 54-year-old female with Stage II left-sided CRC, a non-smoker, with a non-hypermutated phenotype. This patient showed APC hotspot splice site mutation that matched the SBS88 signature and the colibactin-specific motif with early timing (mean of relative timing = 0.37 [95%CI: 0.00-0.68]).The timing of the APC hotspot mutation and 5q loss was calculated, as described in Methods, to have occurred at around 20 [95%CI: 0.00-36.72] years old. Nearly simultaneously, the case experienced TP53 (nonsense), NOTCH1 (missense), and chr17p loss, succeeded by whole genome duplication (WGD) with a mean relative timing of 0.76 [95%CI: 0.75-0.76] (around 41 years old). Additionally, the case exhibited various chromosomal gains and losses (Extended Data Tables 19, 20).
Immune environment in colibactin-associated CRC
The gut microbiota has an impact on both the local gut immune environment and the broader systemic immune system. We investigated the correlation between microbiome-based subtypes and immune cell composition as determined by CIBERSORTx, and no significant differences were observed in the absolute score or proportions of inflammatory cell types among the subtypes (Fig. 5A).
The integration of tumor mutation burden (TMB) and T cell-inflamed gene expression profile (GEP) score24 is useful for classifying tumor immune environments. In the TMBhigh/GEPhigh (immune-hot) category, the number of patients in Subtype 3 was significantly lower than in other subtypes (P = 0.0098, OR = 0.23, Fisher's exact test, one-sided) (Fig. 5B). Moreover, when hypermutated phenotype CRC was excluded from the analysis, patients with a higher ID18 percentage demonstrated significantly more TMBhigh/GEPlow compared to other cases (P = 0.013, OR = 3.3) (Fig. 5C).
A notable correlation was identified between colibactin-related mutational signatures (i.e., either ID18 or SBS88) and immune cells; that is, a higher proportion of ID18 was significantly correlated with an increase in the proportion of regulatory T cells (P = 0.035, Wilcoxon rank sum test, one-sided) (Fig. 5D, Extended Data Table 21). Immunohistochemical staining confirmed the expression of FOXP3 in FFPE sections of CRCs, specifically those with high (CM040) and low (CM053) colibactin signatures (Fig. 5E, 5F). A higher proportion of SBS88 was associated with a reduction in the number of resting CD4-positive memory T cells (P = 0.014) and a lower absolute score (P = 0.017) (Extended Data Table 22). The results suggest that CRCs with higher colibactin-related mutational signatures are located in an immunosuppressive microenvironment.