Data Preparation. Gene expression data for many diseases can be publicly accessed in the GEO database (www.ncbi.nlm.nih.gov/geo). The GSE66359 dataset was generated using the Affymetrix Human Genome U133 Plus 2.0 Array microarray platform and comprises eight cSCC samples and five normal skin samples. The GSE117247 dataset was generated using the Affymetrix Human Genome U95 Version 2 Array platform and includes eight cSCC samples and ten normal skin samples. The GSE45164 dataset was generated using the Affymetrix Human Genome U133A 2.0 Array platform and consists of ten cSCC samples and three normal skin samples. We merged the samples from the GSE66359 and GSE117247 datasets and corrected batch effects using the Combat package in R software (version 4.1.2)11. The merged dataset was designated as GSE66359-117247, comprising sixteen cSCC samples and fifteen normal samples. GSE66359, GSE117247, GSE45164, and GSE66359-117247 will be utilized for subsequent analyses.
Data Preprocessing and DEGs Selection. Quality control and preprocessing of the raw files (*.CEL) obtained from the GEO database were conducted using the AffyPLM package for data quality assessment12. Relative log expression (RLE) plots were employed to evaluate data trend consistency. Normalization was performed using the Robust Multi-Array Average (RMA) algorithm based on R software (version 4.1.2)13, and the results were validated using principal components analysis (PCA). Probes were converted to gene symbols using the hgu133plus2.db, hgu95av2.db, and hgu133a2.db packages. In cases where one gene symbol corresponded to multiple probes, the probe with the median expression value was selected. DEGs between samples were conducted using the limma package14, with criteria set at |log2FC| > 2 and p < 0.05 for gene selection. Volcano plots and heatmaps were generated using the heatmap package to visualize the expression patterns of differentially expressed genes.
Enrichment Analysis. To perform functional-level analysis of DEGs, we conducted enrichment analysis for GO15 and KEGG16 pathways using the clusterProfiler package17 in R software. GO analysis encompasses biological processes, cellular components, and molecular functions. A significance threshold of p < 0.05 was considered statistically meaningful.
Construction of PPI Network. We utilized the STRING online database (https://string-db.org)18 to construct the PPI network, setting a threshold interaction score of > 0.4. Subsequently, we employed Cytoscape software (version 3.9.1) for network visualization19. We utilized the MCODE plugin to extract densely connected modules from the PPI network, with filtering criteria set as degree cutoff = 2, node score cutoff = 0.2, K-core = 2, and maximum depth = 100. The CytoHubba plugin was employed to identify key genes within the PPI network. We extracted the top 30 genes ranked by five algorithms (Degree, MCC, MNC, Closeness, and Stress). Genes that appeared at the intersection of all five methods and were present in the top-ranked module were considered potential key genes.
Construction of WGCNA and Identification of Hub Genes. WGCNA is a practical algorithm for discovering highly correlated gene modules and identifying phenotypes-associated modules20. We employed the WGCNA package to construct a weighted gene co-expression network. The gene expression matrix, GSE66359-117247, obtained after data integration and batch effect correction, was used as input for WGCNA. Initially, we conducted hierarchical clustering of samples to remove outliers and ensure the accuracy of the analysis. To establish a scale-free network, we determined the soft threshold, β, using the pickSoftThreshold function. Subsequently, we constructed the weighted network, created a hierarchical clustering dendrogram, and delineated modules using the blockwiseModules function. To further identify functional modules within the co-expression network, module-trait associations were calculated, and modules with high correlation coefficients were considered cSCC-related modules for subsequent analysis. For each gene within a module, the membership (MM) reflects its correlation with module eigengenes, while the gene significance (GS) represents the correlation of the gene with corresponding clinical traits. Genes with MM > 0.8 and GS > 0.6 were considered candidate key genes, and those overlapping with key genes from the PPI analysis were designated as the final hub genes.
Validation of Receiver Operating Characteristic (ROC) Curves. GSE66359 + 117247 were utilized as internal datasets, while GSE45164 was employed as an external dataset to evaluate the diagnostic efficacy of the hub genes through ROC curve analysis. The Area Under the Curve (AUC), representing the area under the ROC curve, serves as a metric for assessing the performance of various predictive models.
Samples Collection. We selected 30 patients diagnosed with cSCC who received treatment at our institution between April 2017 and April 2022. Surgical excision was performed to obtain corresponding tumor tissues and adjacent normal tissues. All patients underwent routine examinations, including complete blood counts, and had no history of other malignancies. Histopathological diagnosis confirmed the presence of cSCC in all cases.
Immunohistochemistry. The cSCC and adjacent normal tissues were obtained through surgical excision, followed by paraffin embedding and sectioning for immunohistochemical staining. The tissue sections were incubated in 0.1% BSA, followed by overnight incubation at 4°C with antibodies against CCNA2 (Proteintech, 66087-1-Ig), CCNB2 (Proteintech, 66391-1-Ig), and UBE2C (Proteintech, 66726-1-Ig). Biotinylated IgG was used as the secondary antibody. Color development was achieved using the DAB staining kit following the secondary antibody reaction, and the sections were observed and captured using an optical microscope. Finally, analysis was conducted using ImageJ software.
Construction of the ceRNA Networks. We employed five online miRNA databases, RNA22, DIANA-micro T, miRWalk, miRDB, and miRcode, to predict the target miRNAs for the hub genes. Subsequently, we selected miRNAs that were identified in at least four of these databases for further analysis. StarBase (version 3.0) (http://starbase.sysu.edu.cn/index.php) was used to predict the interactions of selected miRNAs with lncRNAs21. We selected lncRNAs that were present in the majority of miRNA prediction results for further analysis. Cytoscape was utilized to construct interaction networks, namely the ceRNA networks, based on interactions among mRNAs, miRNAs, and lncRNAs.
Data Statistics. Statistical analysis of experimental data was conducted using R software (version 4.1.2) and Prism software (version 9.3.1). Student’s t-test was used to compare the differences between the two groups. A paired t-test was employed to validate significant differences in Immunohistochemistry. *** indicated P < 0.001.