Identification of two subtypes basedon TCGA data in OC
After preprocessing,the TCGA-OC dataset had 378 samples, theGSE140082 dataset had 380 samples (including 20 stage I OC tissues and 360 stage II-III-IV OC tissues), and theGSE26193 dataset had 107 samples (including 20 stage I OC tissues and 87 stagesII-III-IV OC tissues) (Table 1). We firstextracted the expression of 1038 invasion-related genesfrom the TCGA database and then used ConsensusClusterPlus to cluster these genes. At k=2, the samplescould be clustered together (Fig. 2A). The expression of 1038 invasion-related genes in the two subclasses is shownin Fig. 2B; 351 genes were highly expressed in the C1group and 687 genes expressed at low levels in the C2 group(Fig. 2C). Wefurther analyzed the prognostic relationship betweenthe two groups, and the results revealed significant differences between C1 and C2 (Fig. 2D; p < 0.05). The gene expression and prognosis of OC patients were analyzed by using KM for survival. It was suggested that the survival probability of C1 was higher than that of C2(P=0.015, Fig. 2D).DEGsbetween stages I and stages II-III-IV ovarian cancer were determined by the limma package in GSE140082 dataset. In total, 696 DEGs werefiltered according to the thresholds of false discoveryrate (FDR) < 0.05 and |log2 fold change (FC)|> 1. Therewere 341 upregulated genes and 355 downregulated genes(Fig. 2E and 2F.). Then we listed the top351 upregulated and 300 downregulated genes according to the value of |log2FC|(Table S1), which demonstrated that these genes might play vital roles in theoccurrence of OC. The ‘gplots’ and ‘ggplot2’ packages of R software were used todraw heatmap and volcanoplot of thegenes (Fig. 2B,C,E,F).
PPI network construction and module analysis of DEGs
Interactions between the identified DEGs were revealed by constructing a PPInetwork. After combiningthe results of DEGs1, DEGs2 and IRGs, 65 hub genes were determined(Fig. 2A),including MMP16,NR2F2, IFIH1, FOSB, MRAS, IDO1, HTR3A, ALDH3B2, CSF1, etc (Table S2). After combiningthe results of MCODE and CytoHubba plugins, 8 hub genes were determined,including CALB2, PEG3, MCC, FBN1, MIR200C, BTC,IRF1, NOD2 (Fig. 2B). The 8 hub genes were loaded into the STRING database, to obtain the PPI data among them,and PPIs with highest interaction score(confidence > 0.3, P < 0.05) were selected. In total,there were 8 nodes and 18 edges in the network (Fig. 2B). The red circle represented positive correlation between the two genes, and the blue trianglerepresented negative correlation between the two genes.
GO function and KEGG pathway analyses for the DEGs
The results of the GO analysis indicated that in biologicalprocess terms, the upregulated DEGs were mainly enriched in regionalization, pattern specification process, forebrain development, downregulated DEGs were significantly enriched in gamma-catenin binding, co-receptor binding (Fig. 2C, Table S3). In molecular function terms,upregulated DEGs were mainly enriched in BMP receptor binding, transmembrane receptor protein serine/threonine kinase binding, and receptor serine/threonine kinase binding, whereasdownregulated DEGs were mainly enriched in gamma-catenin binding,and co-receptor binding (Table S3).
KEGG pathway analysis demonstrated that upregulated DEGs were enriched in Pathways in cancer, TGF-beta signaling pathway, whereas downregulated DEGs were significantlyenriched in Basal cell carcinoma, Hedgehog signaling pathway, Phenylalanine metabolism (Fig. 2D, Table2).
IPA classic analysis demonstrated thatupregulated DEGs were enriched in Pathways in Role of osteoblasts in Rheumatoid Arthritis Signaling pathway, Ga12/13 Signaling, whereas downregulated DEGs were significantlyenriched in Pulmaonary fibrosis idiopathic signaling pathway (Fig. 2E).
Mutations of DE-IRGs in OCs
By analyzing somatic mutation data fromTCGA-OC patients, the differences in genomic alterations were investigated. The oncoprint map showed the top 30 genes with the highest prevalence in DE-IRGs of OCs (Fig. 3A). Missense mutations were the most common mutation type. Among the DE-IRGs, LAMA1 has the highest mutation rate, while the mutation rates of other genes are relatively low. The chromosomal locations of somatic mutations in DE-IRGs are shown in Fig. 3B. According to Fig. 3C, EYA2 and MMP16 showed extensive CNV amplification, while CDH2 and CDH5 showed CNV deletion.
Construction and evaluation of a prognostic risk model
The training set data were used for further analyses.Univariate Cox analysis was performed on 287 DEGsamong molecular subtypes, and 4 genes associated with prognosis were identified as CSF1, LPAR3, CYP4B1 and NOS1. LASSO regression was used to reducethe gene numbers in the risk model. First, we analyzedthe trajectory of each independent variable, as shown inFig. 4A. Next, we used ten-fold cross validation to construct the model and confidence interval under eachlambda, as shown in Fig. 4B, C. 4 genes were chosen from the graph when lambda was 0.00204.The final 4-gene signature formula is as follows:
Riskscore=-0.2043863*EXP(CSF1)-0.0988571*EXP(LPAR3)-0.0705539*EXP(CYP4B1)-0.1775706*EXP(NOS1).
Patients were categorized into high and low risk groups based on the optimal threshold (-5.072868). PCA principal component analysis was performed on the samples, and Fig. 4D showed the ability of the risk score to discriminate between the samples. The prognostic Kaplan–Meier (KM) curveswere performed in the high- and low-risk groups. As shown in Fig. 4E, the survival rate of patients in the low-risk group was significantly higher than that of the high-risk group. The results suggested that the more patients in the training set survived and died, the higher their risk scores were,as shown in Fig. 4E and 4F.
The ROC curve (Receiver Operating Characteristic Curve) was also used toevaluate the prognostic model, and the area under the curve(AUC) was used to predict the accuracy.The AUC of the prognostic signature is 0.778, 0.709, and 0.692 for the 1-year, 2-year, and 3-year ROC curves, respectively (Fig. 4H), suggested that the prognostic model could predict the invasion of OC.
Validation of the GEO dataset
To validate the prognostic model with the screened genes as prognostic features, we used the dataset GSE26193 to evaluate the results of the training set. The risk score was calculated using the coefficients of the above genes, and the patients were categorized into high-risk and risk groups based on the optimal threshold (-7.079729).
For GSE26193, the KM curve showed that the survival rate of patients in the low-risk group was significantly higher than that of patients in the high-risk group (P < 0.05, Fig. 5A).As shown in Fig. 5B,C,D, the more patients in the validation set survived and died, the higher their risk scores were. The AUCof prognostic signature at 1, 2, and 3 years was 0.812, 0.737, and 0.628, respectively (Fig. 5E), which showedsome predictivity.
Relationship between risk score and tumor microenvironment
The tumor microenvironment significantlyaffected the therapeutic effect and prognosis of tumor.The ssGSEA was used to assess the immunestatus of the two groups. The difference in immune cells between high and low risk groups was compared through Wilcoxon analysisand seven significant immune cells were obtained, such as Type 2 T helper cell, Plasmacytoid dendritic cell, Monocyte, Macrophage, Immature dendritic cell, Gamma delta T cell, Central memory CD4 T cell(Fig. 6A).
Additionally, we further explored the correlationsbetween the risk score and immune cell contentby spearman analysis. Theresults showed that CSF1 was positively correlated with Plasmacytoid dendritic cell and Monocyte, and negativelycorrelated with Type 2 T helper cell; CYP4B1 was positively correlated with Plasmacytoiddendritic cell, and negativelycorrelated with Monocyte and Macrophage; NOS1 and LPAR3 wasnegativelycorrelated with Monocyte and Macrophage (Fig. 6B).
Univariate and multivariate analyses of the risk score for different clinical features
To assess the independence of the risk score signature modelfor clinical application, we used Univariate and multivariate Cox regression to analyze the clinical informationand the risk score. By comparing the distribution of the risk score among clinical feature groups,significant differences were found among age, stage,and grade; (Fig. 7A, F, G, H; P< 0.05) and no significant difference was found among the grade, treatment and whether histology serous (Fig. 7A; P> 0.05). The results also showed that the high-risk groups had poorer prognosis than low-risk groups. In the TCGA dataset, multivariate Coxregression analysis showed that the risk score (HR=2.641,95% CI 1.733–4.024, p <0.001) was significantly associated with prognosis ( Fig. 7B). Heatmap showed the relationship between clinical feature and risk groupings ( Fig. 7I). The findings show that therisk score signature model had good prediction in clinical.
Construction and evaluation of the prognostic risk genes
First, the Limma package was used to compare the differences in gene expression levels between the high- and low-risk groups for the training set, with the conditions of differential gene screening: |log2FC|>0.5, P.adjust<0.05, and the volcano plots were plotted according to the results obtained, with the samples from the low-risk group as the control. Taking the samples of the low-risk group as the control, a total of 265 differentially expressed genes were obtained from the training set by differential expression analysis, of which 81 genes were up-regulated genesand 184 were down-regulated, and the heatmap and volcano map were generated in Fig. 8A and Fig. 8B.
Functional enrichment analysis of differentially expressed genes
To understand the differentially expressed genes between the high and low risk groups, GO and KEGG enrichment analysis of differentially expressed genes between the high and low risk groups was performed. As shown in Fig. 8C, 21pathways were gathered throughfunctional enrichmentanalysis with GO, such as embryonic organ morphogenesis, sensory organ morphogenesis and positive regulation of G protein-coupled receptor signaling pathway. For KEGG, six pathways were involved, such as Estrogen signaling pathway and Malignant pleural mesothelioma, et al,as shown in Fig. 8D.
Expression of the signature genes in OC tissues
To verify the accuracy of the 4-gene signature, we examined the expression of the signature genes(CSF1, CYP4B1, LPAR3 and NOS1)in clinical samples from OC patients by qPCR and IHCanalyses. The expression levels of signature genes in the training and validation sets were shown as Fig. 9Ain early OC (stage I) and Fig. 9Bin transfer OC (stage II~IV). In trainingset, theexpressions of CSF1, CYP4B1 and LPAR3were higher in stage II~IV than in stage I and NOS1 lower than in stage I. While in validation set, theexpressions of CSF1, CYP4B1LPAR3 and NOS1 were the same trends as the trainingset, but the difference was not statistically significant. The IHC showed the same trends of the four signature genes expression in both data sets (Fig. 9C).