Descriptive statistics and Evaluating the correlation between immune-infiltration status with clinical characteristics
Patients’ clinical characteristics in the 4 studies were detailed in Table 1. Flowchart outlining the research process was showed in Figure 1. Variable immune infiltration pattern was analyzed by ssGSEA method, and the results were set out in Figure 2a. Using the unsupervised clustering method, the samples were divided into two subgroups—termed high immune-infiltration group (n=48), low immune-infiltration group (n=159). By the chi-square test, we found that patients with high immune-infiltration status had a higher probability to achieve the pCR (p=0.022). Nevertheless, in ER-HER2+ breast cancer, no correlation between the immune infiltration status and pCR was found (p=0.24).
Comparing the TILs subpopulations between the pCR and non-pCR group by the Kruskal-Wallis test, we found that B cells, cytotoxic cells, neutrophils and CD56+ dim natural killer cells were significantly higher in the pCR group (Figure 2b). On the contrast, a significant decrease of eosinophils was observed in the non-pCR group. Figure 2c showed an increase of almost all kinds of immune cells in the high immune-infiltration group compared with the low group except for type-2 T helper cells. We also investigated the predictive value of TILs subpopulations using the univariate logistic regression, revealing that B cells, cytotoxic cells, T cells, eosinophils, neutrophils, CD56 dim natural killer cells may serve as independent biomarkers to predict the efficacy of NAT. Results were presented in Table 2 .The interactive correlations among the TILs were presented in a correlation heatmap (Figure 2d). Using the hierarchical cluster analysis, the TIICs are clustered into 3 groups. Interestingly, the TILs subpopulations which were positively related to pCR were enriched into the second cluster, while the eosinophils in the first cluster, were negatively related to NK cells and cytotoxic cells obviously.
Exploring the potential pathways in differential immune infiltration patterns
To better understand the intrinsic mechanisms of different immune infiltration status, a total of 80 genes were screened out according to the immune infiltration subtypes, including 59 up-regulated genes and 31 down-regulated genes. The results were described as a volcano plot in Figure 3a.
We analyzed the cellular component, biological process and molecular function sections in GO analysis setting the threshold of p value<0.05. As is shown in Figure 3b, given the results of GO analysis, DEGs were significantly enriched in adaptive immune response, positive regulation of chemokine production, cytokine receptor binding and activity and so on. KEGG analysis indicated potential active pathways, such as NF−kappa B signaling pathway, natural killer cell mediated cytotoxicity, cytokine−cytokine receptor interaction. The enrichment plots of most significant pathways in KEGG pathways were visualized in Figure 3c, 3d.
Genes modules identified by WGCNA and functional enrichment analysis
The top 25% genes with the greatest variance were used to construct the co-expression networks. All the genes included passed the goodSamplesGenes test and 2 samples were removed as the outlier samples. 205 samples and 3072 genes were finally used to construct the networks. The correlations of the different soft threshold with the scale independence and mean connectivity were plotted in Figure 4a. After observing the network results by using different soft threshold (3, 4, 5), we finally determined to set soft threshold at 3 to ensure optimal module connectivity and scale independence. A total of 12 gene modules were identified by WGCNA, 2 modules with z-scores<10 were deleted in the preservation test (Figure 4b, 4c). Supplementary table 1 showed the genes in the 12 modules respectively. Pearson test was applied to analyze the association between the gene modules expression and clinical traits, and a heatmap (Figure 4d)was plotted to describe the results. Interestingly, the magenta module (no. of genes in magenta module=69)and brown module (no. of genes in brown module=277) were related to both the pCR status and immune infiltration status. Venn plot (Figure 3e)presented that 61 genes were overlapped between the genes in above two modules and DEGs according to immune infiltration status. By the gene functional enrichment analysis, the pink, brown and magenta modules were related to pCR, which were enriched in leukocyte activation, B cell receptor signaling pathway, nuclear signaling by ERBB4, respectively. Except for brown module and magenta module, the tan module, turquoise module, red module, green module, yellow module and greenyellow modules were associated with immune infiltration status, which were correlated with GTP hydrolysis and joining of the 60S ribosomal subunit, PPAR signaling pathway, packaging of telomere ends cell cycle, mitotic, interferon alpha/beta signaling, metabolism of lipids and lipoproteins.The functions enriched of the 10 modules were detailed in Supplementary table 2. Two scatterplots of Gene Significance vs. Module Membership of the magenta and brown modules were set out respectively in (Figure 4e).
Screening out genes linearly or non-linearly associated with pCR and optimizing the model
Using univariate logistic analysis, we filtered 23 genes significantly related to pCR in the brown module (p<0.01) and 7 genes in magenta module (p<0.05). The results of univariate logistic analysis were detailed in Table 3. After adding the ER status, age, pretreatment T stage in multivariate logistic model, five genes, which were CD72, PTGDS, CYTIP, CCL5, PAX5 and patients’ ER status were proved to influence the probability to achieve the pCR synergistically, detailed in table 4. Next, we constructed a generalized linear model consisted of these 6 variables, which was validated its predictive accuracy and specificity in the external validation cohort (GSE66305, n=88). AUC in ROC was 0.618 (Figure 6a). Secondly, univariate spline regression model was employed to identify the genes significantly related to pCR non-linearly. We totally found 5 genes, which were significantly correlated to pCR non-linearly (p<0.1). The cubic spline plots were plotted to visualize the non-linear effect of these genes (Figure 5 a-e). We compared the C-index of the variables in the training cohort with different knots numbers, and the result was presented in Supplementary table 3.
We constructed a generalized non-linear model consisted of both the linear form of CD72, PTGDS, CYTIP, CCL5, PAX5, ER status and the non-linear form of CD3G (no. of knots=3), IL2RB (no. of knots=5), TRAC (no. of knots=5), MAP7(no. of knots=5), PTPRC (no. of knots=5). The formula used in lrm function in rms R package is as following:
f=as.formula(pcr~CD72+er_status+PTGDS+CYTIP+CCL5+PAX5+rcs(MAP7,5)+rcs(IL2RB,5)+rcs(CD3G,3)+rcs(PTPRC,5)+rcs(TRAC,3)).
To find out the model with the largest predictive value, we used 5-fold cross validation to construct the final model. When we displayed the final model in the internal validation cohort, AUC was up to 0.815 and in the external cohort, the AUC was 0.797. The ROCs in the internal and external cohorts were showed in Figure 6(b, c) respectively. Variables in our model and their coefficients were described in Supplementary table 4.