Acquisition and Grouping of Raw Expression Profiling Data
The flow diagram of the study was presented in (Fig.1). We first selected the databases with as many samples as possible according to the comparison conditions between pancreatic cancer and normal tissues from GEO databases. Finally, we chose three datasets (GSE15471, GSE16515, GSE32676), GSE15471 contains 39 normal samples and 39 tumor samples. GSE16515 contains 16 normal samples and 36 tumor samples. These two databases were combined as the train group, and GSE32676, which contains 7 normal samples and 25 tumor samples, was used as the test group.
Identification of DEGs
The DEGs were screened out after comparing the train group with the test group, and according to the principle of logFC ≥1 or ≤ −1 and adj. p value <0.05, 55 DEGs were singled out, of which 39 were up-regulated and 16 were down-regulated in the test group. These DEGs were shown by heatmap and volcano map (Fig.2A,B).
Visual Enrichment Analysis of DEGs
Firstly, Metascape analysis was carried out to find out the pathway and function of DEGs enrichment, and it was displayed by a bar diagram (Fig.3A)and network diagram (Fig.3B). Further, GO enrichment analysis was performed, and DEGs were mainly enriched in epidermis development(BP), endoplasmic reticulum lumen(CC), and sulfur compound binding(MF) (Fig.4A,B). The KEGG pathway analysis revealed that Pancreatic secretion and Complement and coagulation cascades as important enrichment pathways for DEGs (Fig.4C,D). At last, we protracted a PPI network diagram to explore the potential features of these DEGs (Fig.5).
Screening and Verifying the Feature Genes of Pancreatic Cancer
The random tree diagram showed the errors of the control group, the treat group, and all samples (Fig.6A). We found the genes represented by the points with the smallest cross-validation errors and score these genes. The higher the score, the more important it is. Ten genes were selected according to the principle that the important score was > 2, namely FGD6, ANO1, POSTN, AHNAK2, FN1, SLC39A5、RHBDL2、MTMR11、SQLE, and ADAM9 (Fig.6B). The heatmap presented the different expressions of the 10 feature genes in both groups (Fig.6C).
Neural Network Model Construction and Identification
According to gene scores and weights, a neural network model was constructed to identify sample attributes (Fig.7A). The input layer was 10 genes with scores > 2, and 52/55 in the control group were correctly predicted, and 73/75 in the treat group were accurately predicted. Then, the ROC curves were established separately to detect the accuracy of the model in predicting the attributes of the sample. ROC curves of the control group and treat group were drawn to verify the accuracy of the model in predicting sample attributes. The area under the ROC curve of the train group is 0.990 (95% CI: 0.976–1.000) (Fig.7B), which proved that the accuracy of its neural network model is high. Further external verification showed that the area under the test group curve is 0.869 (95% CI: 0.720–0.983) (Fig.7C), which proved that the neural network model has high accuracy.
Construction of the neural network model.A Neural network model were built to predict genetic properties and consist of an input layer, hidden layer, and output layer;B Plotting ROC curves to detect the accuracy of the train group neural network model, the AUC was 0.990 (95% CI: 0.976–1.000);C ROC curve detection test group neural network model accuracy, the AUC was 0.869 (95% CI: 0.720–0.983).
Distribution of Immune Cells Infiltrating
With the CiberSort algorithm, we calculated the scores of 22 kinds of immune cells in each sample to evaluate the immune infiltration state (Fig.8A). The results showed that the activity of B cells memory and T cells gamma delta in the treat group declined significantly, while the activity of Neutrophils increased significantly (Fig.8B). Finally, we drew a correlation heatmap to reveal the correlation between immune cells (Fig.8C).
The Relationship between Feature Genes and Prognosis
In order to detect whether the feature genes in the model are closely related to the prognosis of pancreatic cancer, we obtained the gene expression profiles and corresponding survival information of 178 pancreatic cancer patient samples from TCGA and performed further OS, PFS, and ROC analysis. The results showed that only three feature genes—ANO1, AHNAK2, and ADAM9, were significantly associated with prognosis in all three analyses (p<0.05) (Fig.9 and Supplement Fig.1, 2 and 3), which means that these three feature genes may act as molecular markers for predicting the prognosis of pancreatic cancer patients.
Immunohistochemical Staining Images Validation
In the above analysis, we confirmed that all three feature genes were expressed higher in cancer tissue than in normal tissue at the transcriptome level. To further determine whether the feature genes are also present as proteins expressed in PDAC, we investigated the expression of these genes in HPA. This analysis could confirm the protein expression from feature genes utilizing data from IHC staining images. Except for ADAM9, which did not obtain images from HPA, immunohistochemical images of ANO1 and AHNAK2 showed high protein expression of genes in cancer tissue, especially AHNAK2 (Fig.10). Although there were no ADAM9 images, previous literature have confirmed that ADAM9 was high protein expression in pancreatic cancer samples and promoted the development of pancreatic cancer[13, 14].