At the first step, six PBMCs gene expression of human microarray datasets associated with SLE, which originated from three platforms, were downloaded and each paired data related to the same platform integrated by cross-platform normalization algorithm. Subsequently, SLE, TCR and BCR signaling pathways were employed as structures of BN and trained by three major datasets individually by BNrich method. Consequently, the significant parameters merged by meta-analysis to achieve the final criteria and detect alterations in considering pathways. The workflow was shown in Figure 1.
Figure 1. Workflow and analysis procedure; from row SLE and healthy control (HC) data in various datasets to identify signaling pathways alterations in SLE patients.
2.1. Gene expression datasets
The PBMCs gene expression of human microarray datasets that contain both SLE patients and healthy control subjects (published or updated in 2010-2019) downloaded from Gene Expression Omnibus (GEO) database: GSE17755 [25], GSE 12374 [26], GSE 50772 [27], GSE 81622, GSE 121239 [28,29] and GSE 126307 [30]. We described the details of the data in Table 1.
Table 1. Description of the datasets used in the study.
NO
|
GSE-NO
|
GPL/Platform
|
No of sample
|
Cell-type
|
Update(year)
|
Race
|
SLE patients
|
control
|
1
|
17755
|
1291/ Hitachisoft
|
22
|
55
|
PBMC
|
2010
|
Japanese
|
2
|
12374
|
1291/ Hitachisoft
|
11
|
6
|
PBMC
|
2012
|
Japanese
|
3
|
50772
|
570/Affymetrix
|
61
|
20
|
PBMC
|
2015
|
Unknown1
|
4
|
81622
|
10558/Illumina
|
30
|
25
|
PBMC
|
2016
|
Unknown2
|
5
|
121239
|
13158/Affymetrix
|
65
|
20
|
PBMC
|
2018
|
Unknown3
|
6
|
126307
|
13369/Illumina
|
31
|
9
|
PBMC
|
2019
|
Several races4
|
1 The data collected in the USA/South San Francisco, but the race of subjects is unknown.
2 The data collected in the USA/Dallas, but the race of subjects is unknown.
3 The data collected in Spain / Granada, but the race of subjects is unknown.
4 The data collected in USA/Dallas, but the race of subjects is Australian, Australian (Irish / Scottish descent), born in India-ethnicity unknown, Caucasian, Caucasian/Japanese, English, Filipino, Indian, Iraqi, Latin American, Persian, Spanish, White Australian / Anglo-Celtic.
2.2. Cross-platform normalization
Since the gene expression datasets of PBMCs were related to different platforms, we merged the data through cross-platform normalization algorithms [23] to (1) increase sample sizes and improve genes signature selection [23,31–34]; (2) appraise the heterogeneity of the overall estimate, and (3) decrease the effects of individual study-specific biases [23,34]. Finally, we integrated all datasets related to the same platform and built three major datasets from Hitachisoft, Affymetrix and Illumina.
All procedures for data processing and integration were carried out using the R statistical programming language. Each expression datasets were preprocessed by normalizeQuantiles function from R package limma and then all datasets derived from the same platform were integrated with Reduce function. Afterwards, the empirical Bayes method (ComBat) from the R package sva was used for batch effect removal [35].
2.3. The signalling pathways as prior information
The SLE signaling pathway (hsa:05322) and two most important pathways in SLE pathogenesis, TCR (hsa:04660) and BCR (hsa:04662) signaling pathways were implemented in this study. All of the pathways were extracted directly from the KEGG database (Release 90.0, April 1, 2019) [36].
2.4. Bayesian network approach
The three desired pathways were exploited as structures of BNs [21]. In the parameter estimate step, the mean value of gene expression for each node can be modelled as a linear regression of its parents’ gene expression because the gene expression data is continuous [22,24]. Thus, when gene has parents in the network and and describe the estimations of for the SLE patient and healthy control datasets, respectively; they can be modelled as follows: (see Equations in the Supplementary Files)
The and are the residual values. Estimated parameters of BNs were derived from the following distribution [37]: (see Equations in the Supplementary Files)
The parameters of the structures were estimated by SLEs (patients) and healthy controls of the three major datasets individually. It means for each parameter six values were available; SLE vs controls in three datasets. We compared corresponding β parameters between each pair of trained networks (patient and healthy states) by independent t-test [38] and gained their p-value or false discovery rate (FDR). Consequently, for every parameter related to a node or an edge, for example , after independent t-test between patient and healthy states, we had FDRA (Affymetrix), FDRI (Illumina) and FDRH (Hitachisoft).
2.5. Meta-Analysis
By using the meta-analysis approach [23], we incorporated the results of platforms rather than continue to integrate different platforms data. At the first step, the three major datasets were assessed by BNrich method individually. The FDR was also acquired from independent t-test between all paired parameters associated with the SLE patient ( ) and healthy control ( ).
Subsequently, we defined the parameter similar to the expression fold-change, but for demonstrating the variations of BNrich parameters, as below: (see Equations in the Supplementary Files)
which refers to the the corresponding parameter in the SLE and healthy control.
Finally, we integrated the acquired significant parameters of BNrich (FDR < 0.05 & > 1) in the above-mentioned pathways by the following meta-analysis method to compute the final criteria: (see Equations in the Supplementary Files)
where , and are in Affymetrix, Illumina and Hitachisoft datasets based on parameter. Furthermore, , = 95 and were the number of subjects in Affymetrix, Illumina and Hitachisoft datasets, respectively. If the parameter in each dataset was not significant, the weight of the parameter in that dataset was zero. When the was associated with the node parameters, the sign of indicated up-regulation (+) or down-regulation (-) of the gene and the absolute value of was used to measure the value of activation or inhibition. Where the was associated with the edge parameters, the sign of indicated increasing (+) or decreasing (-) the biological function and the absolute value of was used to measure the value of these increasing and decreasing functions.