Proposed method for identification of DEGs based on gene clustering
In this study, we devised a new analysis method that uses the functionalities of the MBCluster.Seq package. Therefore, for consistency and clarity, we used notations that were similar to those described in the original paper [28]. First, we outlined a method to describe the input/output relationship. We denote an input count matrix as one where each row indicates a gene g (= 1, …, G), each column a replicate j (= 1, …, ni) of group i (= 1, …, I), and each cell the number of counts. Here, G is the number of genes, I is the number of compared groups, and ni is the number of replicates for the group i. MBCluster.Seq clusters gene vectors βg = (βg1, …, βgI), where βgi indicates the count of gene g in the group i relative to the overall mean on a log-scale. Therefore, the summation of βgi for a gene g across all compared groups was 0. βg can be described as the log fold-change (FC) between compared groups. Given a preselected number of clusters K, MBCluster.Seq gives two results as an output. One is the center for cluster k, µk = (µk1, …, µkI) for k = 1, …, K, and the other is the posterior probability (PP) that gene g belongs to the kth cluster, pg = (pg1, …, pgK) for g = 1, …, G.
A ranked gene list for a cluster mainly consisting of non-DEGs can be obtained via MBCluster.Seq by using the assigned PPs. To identify the non-DEG cluster, we consider the L2 Norm of µk for each cluster center across groups, ||µk||2 = (|µk1|2 + … + |µkI|2)1/2. A smaller value of the norm for cluster k indicates a smaller degree of DE across groups for the cluster. Accordingly, we can regard the probability of a gene being located in the kth column as that in the non-DEG cluster, where k = argmin(||µ1||2, …, ||µK||2). A lower value of the PP for gene g in the kth cluster (i.e., pgk) indicates a higher degree of DE between the compared groups. For simplicity, we call the proposed method based on MBCluster.Seq as “MBCdeg”.
MBCluster.Seq employs a scaling normalization algorithm [33] as the default, where the upper quartile (i.e., 75th percentile) counts for individual samples are used as the scaling factors to obtain equal values for upper quartile counts across all samples after the normalization. This type of conventional normalization algorithm assumes that the proportion of DEGs (PDEG) in the data is small (e.g., less than 5%) or that the proportion of genes upregulated in individual groups in the DEGs is approximately balanced (i.e., P1 ≈ P2 when comparing groups 1 and 2; P1 + P2 = 1) [34–36]. However, in practice, these assumptions are invalid in certain cases, such as when PDEG ≈ 60% [37] and P1 > > P2 [38]. To investigate the effect of various normalization algorithms on the data analysis, we employed a competing method (called DEGES) [7, 14]. The main arguments in its favor are (1) DEGES was originally designed to manipulate such scenarios (~ 25% of PDEG with P1 > > P2), (2) it can be applied to ~ 60% of PDEG [35], and (3) the output (i.e., normalization factors) can easily be applied to the framework of MBCluster.SEq. Henceforth, we refer to MBCdeg using the default normalization algorithm and with the DEGES normalization algorithm as “MBCdeg1” and “MBCdeg2”, respectively.
Analysis of simulated data for a two-group comparison
We show an example of MBCdeg based analysis for simulated data for two groups: G = 10,000, n1 = n2 = 3, PDEG = 0.25, P1 = 0.9 (or P2 = 0.1), and FC = 4. We performed a run for MBCdeg with K = 3, assuming three expression patterns: up-regulated in group 1 (“DEG1” pattern), up-regulated in group 2 (“DEG2”), and consistent in both groups (“non-DEG”). Using this approach, we ideally obtain the centers for the cluster k (= 1, 2, 3) as µ1 = (0.69, -0.69), µ2 = (-0.69, 0.69), and µ3 = (0.00, 0.00), with loge(2) = 0.69. Clearly, the third cluster has the smallest L2 Norm (i.e., ||µ3||2 = 0). We therefore regard this cluster as containing many non-DEGs and used the PPs assigned to this cluster for estimating the overall degree of DE.
Here, MBCdeg1 gives the output µ1 = (0.56, -0.56), µ2 = (-0.14, 0.14), and µ3 = (-0.81, 0.81) and uses the PPs of the second cluster displaying the smallest norm (||µ2||2 = 0.195) for gene ranking. Meanwhile, MBCdeg2 outputs µ1 = (0.68, -0.68), µ2 = (-0.69, 0.69), and µ3 = (-0.02, 0.02) and uses the PPs of the third cluster displaying the smallest norm (||µ3||2 = 0.027) for gene ranking. The expression patterns for the cluster centers from MBCdeg2 were closer to the ideal patterns than those from MBCdeg1. This result suggests that the DEGES normalization may be useful in the framework for MBCdeg, at least in similar scenarios.
Next, we evaluated the performance of the overall gene ranking using the area under the receiver operating characteristic (ROC) curve (i.e., AUC). The AUC enables data comparisons without a tradeoff in sensitivity and specificity because the ROC curve is created by plotting the true positive rate (i.e., sensitivity) against the false positive rate (1 – specificity) obtained for each possible threshold value [39–41]. Therefore, an accurate method should provide high AUC values. We compared five methods: two clustering-based methods (MBCdeg1 and MBCdeg2) and three conventional methods (edgeR [11], DESeq2 [15], and TCC [14]). The edgeR and DESeq2 packages are widely used [25]. TCC is used as the main alternative to MBCdeg as it utilizes the DEGES normalization algorithm.
Figure 1 shows the AUC values for the five methods using various simulations: PDEG = 0.05 and 0.25, with P1 = 0.5, 0.7, 0.9, and 1.0. A higher P1 value indicates a higher degree of upregulated DEGs in group 1, ranging from symmetric (P1 = 0.5) to completely asymmetric (P1 = 1.0) conditions. The AUC values for the two MBCdeg methods were higher than those from the conventional DE methods. TCC showed the best among the three conventional methods because the simulated data was generated using this package and the DEGES normalization algorithm in TCC was originally designed to manipulate such asymmetric scenarios (i.e., P1 > 0.5). Therefore, the high performance of MBCdeg over TCC is an interesting result.
The performance of MBCdeg does not appear to depend on whether the simulation scenario is symmetric or asymmetric. Most of the trials of MBCdeg showed AUC values of 0.93, and the range of P1 from 0.5 to 0.9, suggesting that the framework of MBCdeg is intrinsically robust for both symmetric and asymmetric data. As indicated above, we confirmed that MBCdeg2 tends to have lower norm values for non-DEG clusters (i.e., ||µk||2 with k = argmin(||µ1||2, …, ||µK||2)) than MBCdeg1 in scenarios with P1 > 0.5. This can be explained by the use of DEGES in the MBCdeg2 algorithm.
The overall performance of MBCdeg2 was very similar to that of MBCdeg1 although the former performs better than the latter in completely asymmetric scenarios (i.e., P1 = 1.0). We hypothesized that the key to accurate gene ranking may be to identify non-DEG cluster, and not to make the ideal non-DE pattern. This trend was also observed when the number of replicates per group was increased to n1 = n2 = 6, 9, and 12 (Additional file 1). In a comparison of the performance of each method using different numbers of replicates (Fig. 1 and Additional file 1), we observed that AUC values tend to increase as the number of replicates increases and this trend is consistent with a previous report [20].
Effect on different numbers of clusters
We noted that MBCdeg tends to increase the variation in AUC between trials in association with an increase in P1 values, and this effect was greatest at P1 = 1.0. For example, the AUC values for MBCdeg1 and MBCdeg2 were less than 0.8 for ten and three trials, respectively, when PDEG = 0.05 (Fig. 1). The relatively poor performance of MBCdeg at P1 = 1.0 than with other values of P1 can be explained by the difference between the true number of clusters (i.e., Ktruth = 2) in this condition and the number of clusters that were used (K = 3). The simulated data obtained with P1 = 1.0 has two expression patterns; “DEG1” and “non-DEG”, and does not have the “DEG2” pattern.
To demonstrate the effect of using MBCdeg on a predefined number of clusters, we investigated the AUC values for MBCdeg with K = 2–4 (Fig. 2). Our results showed that the AUC values were the highest when K = Ktruth. The highest AUC values for P1 = 0.5, 0.7, 0.9, and 1.0 were obtained using K = 3, 3, 3, and 2, respectively (indicated by the downward arrow in light blue). This result suggests that the use of an accurate number of clusters is important when performing an analysis using MBCdeg. The distributions of the AUC values indicate the K values that should be used in practice. At P1 = 0.5 and 0.7 (i.e., Ktruth = 3), the results for K > Ktruth (i.e., K = 4) were more accurate than those with K < Ktruth (i.e., K = 2). Further, this trend was observed when the number of replicates per group was increased to n1 = n2 = 6, 9, and 12 (Additional file 2).
In practice, it may be safe to adopt a larger K value. For two-group comparisons, the user would only need to run MBCdeg with K ≥ 3. Although the frequency of trials that result in a low performance of gene ranking is relatively high at larger P1 values, the probability of obtaining extremely asymmetric results such as P1 = 1.0 is low, and may at most indicate P1 = 0.9. Therefore, both methods (MBCdeg1 and MBCdeg2) may be useful in scenarios that are similar to the conditions investigated.
Although the probability of extremely low performance at P1 ≤ 0.9 is at most 2%, we should discuss the reasons. For example, the use of MBCdeg2 with K = 3 outputs one such result (AUC = 0.634) with PDEG = 0.25 and P1 = 0.5. As this simulated data consists of 1,250 DEG1, 1,250 DEG2, and 7,500 non-DEG patterns, almost all trials with high AUC values (> 0.92) gave three clusters, each of which consisted of many genes that resembled one of the three patterns, and having one predominant pattern (Table 1a). However, for the trial with low AUC value (0.634), the numbers of genes in the three clusters were 1,299, 8,670, and 31, respectively (Table 1b). Of these, the first cluster consisted of the most genes showing the DEG2 pattern, i.e., 12 DEG1, 935 DEG2, and 352 non-DEG patterns. The second cluster consisted of a great majority of genes with non-DEG patterns, i.e., 1,238 DEG1, 289 DEG2, and 7,143 non-DEG patterns. As this non-DEG cluster contains 1,238/1,250 = 99.0% of the genes with a true DEG1 pattern, these ranking inaccuracies may be responsible for the low AUC values obtained. As per our analysis, most low AUC values result due to the incorporation of DEGs (i.e., the DEG1 or DEG2 patterns) into the non-DEG clusters.
The third cluster shown in Table 1b had very few genes (i.e., 26 DEG2 and 5 non-DEGs). These results indicate that the first two clusters determine the performance of the trial. The results from using K = 2 with the same simulation conditions (i.e., PDEG = 0.25 and P1 = 0.5 in Fig. 2) indicate that the distribution of AUC values located around 0.63 may thus be reasonable. Similar inferences can be made for other results, for example, with PDEG = 0.25 and P1 = 0.7. As the simulated data consists of 1,750 DEG1, 750 DEG2, and 7,500 non-DEG patterns, MBCdeg with K = 2 tends to output one cluster that is mainly composed of non-DEG and DEG2 genes, with the second cluster mainly composed of DEG1 genes. The median AUC values (≈ 0.77) when using MBCdeg with K = 2 at P1 = 0.7 are higher than those (≈ 0.63) at P1 = 0.5, and this can be explained by the smaller number of true DEG patterns (750 < 1,250) contained in the non-DEG cluster. As this phenomenon was observed using both MBCdeg1 and MBCdeg2, it does not result due to the differences in the normalization algorithms. MBCdeg determines clusters stochastically, and therefore an output of results that appear as outliers will be obtained with a certain probability.
Effect on different degrees of DE
For the simulation analysis described above, the degree of DE was fixed at 4-fold (i.e., FC = 4), which limits the expression pattern of genes upregulated in a group to one, which is favorable when using MBCdeg. However, in practice, the degree of DE would differ for genes. Therefore, we performed simulations using different degrees of DE [7], where the FCs for DEGs were randomly sampled from "1.2 + a gamma distribution with shape = 2.0 and scale = 0.5." Accordingly, the minimum and mean fold-change were approximately 1.2 and 2.2 (= 1.2 + 2.0 × 0.5), respectively.
The results of the analysis indicated that greater variability in the performance of MBCdeg is observed compared to the results obtained using fixed FC, as shown in Fig. 1 (Additional file 3). AUC values are lower overall than those from the fixed FC comparison (Fig. 1) in the same condition, and this can be explained by the lower degree of DE (4-fold → 2.2-fold). Though MBCdeg gave superior results than other packages (edgeR, DESeq2, and TCC) with PDEG = 0.25, it did not perform as well using PDEG = 0.05. This may occur due to the small number of DEGs (10,000 × 0.05 = 500 genes) under the conditions used for analysis and the resulting instability of clusters that mainly consisted of DEG1 or DEG2 patterns. As the total number of DEGs is relatively large with PDEG = 0.25, clusters containing more DEG1 or DEG2 patterns are more likely to result and are probably more stable.
Notably, the AUC values for MBCdeg are not always the highest when K = Ktruth (Additional file 4). With PDEG = 0.25 and P1 = 0.9, the AUC values tend to increase as the number of clusters K increases. This may occur because the simulated data contains more expression patterns than Ktruth. In the conditions used here, Ktruth = 3 was used for convenience. However, genes upregulated in a group may show various degrees of DE, such as 2.1-fold and 2.7-fold, and allowing the formation of separate clusters would have reduced the possibility of the inclusion of these genes in the non-DEG cluster.
Effect on larger PDEG values
The results described above were obtained with PDEG ≤ 0.25 and K = 3, and it may be reasonable to expect the largest cluster to consist of many non-DEGs for PDEG < 0.5. However, there may be instances where PDEG > 0.5 [37, 42], and we investigated the effect of P1 = 0.5, 0.7, 0.9, and 1.0 on larger PDEG values (= 0.45–0.75) (Fig. 3). Our results showed that MBCdeg gave high AUC values in most trials with PDEG ≤ 0.55 and P1 ≤ 0.9. However, its performance was decreased drastically in other conditions (e.g., P1 ≥ 0.9 and PDEG ≥ 0.65).
Overall, the performance of MBCdeg was similar to TCC, and there were few differences between the performance of MBCdeg1 and MBCdeg2. However, MBCdeg1 is superior to both TCC and MBCdeg2 for PDEG = 0.75 and P1 = 0.7. This may be explained by the failure of DEGES normalization that was used in TCC and MBCdeg2. The DEGES normalization works well in the conditions where PDEG ≤ 0.45 and P1 ≤ 1.0, PDEG ≤ 0.55 and P1 ≤ 0.9, PDEG ≤ 0.65 and P1 ≤ 0.7, and PDEG ≤ 0.75 and P1 ≤ 0.5. However, the performance of the normalization decreases in other conditions, and these results are similar to those reported by Evans et al. [35].
Table 2 shows a representative result for MBCdeg for PDEG = 0.75, and P1 = 0.7. In this condition, 5,250 DEG1, 2,250 DEG2, and 2,500 non-DEG patterns were observed. MBCdeg1 gave three clusters, each of which consisted of many genes that resembled one of the three patterns, with a particular pattern predominating. In the trial shown in Table 2a, MBCdeg1 successfully identified the third cluster displaying the minimum norm value (= 0.3145) as the non-DEG cluster and produced an accurately ranked gene list with a high AUC value (= 0.9329). However, in the same trial shown in Table 2b, MBCdeg2 incorrectly determined the third cluster displaying the minimum norm value (= 0.0762) as the non-DEG cluster and produced an inaccurately ranked gene list with a low AUC value (= 0.2421). As the third cluster mainly consists of genes with the DEG1 pattern (4,942/5,637 = 87.7%), genes that belong to the first and second clusters will be located at the top of the ranked gene list. The AUC value (= 0.2421) may result due to genes in the first cluster showing predominantly DEG2 patterns (1,979/2,436 = 81.2%) that are correctly ranked at the top.
In the scenario where MBCdeg1 is inferior to MBCdeg2, we observed 22% of trials (= 11/50) with extremely low AUC values (< 0.2) for MBCdeg1 with PDEG = 0.45 and P1 = 1.0. As this condition has 4,500 DEG1 and 5,500 non-DEG patterns, the accurate number of clusters is two (i.e., Ktruth = 2). Table 3 shows the representative case of an MBCdeg trial with K = 3. We observed that MBCdeg1 incorrectly determined the third cluster that displayed the minimum norm value (0.4100) as non-DEG one and produced an inaccurately ranked gene list with an extremely low AUC value (= 0.1195). As this condition does not contain any DEG2 pattern, the failure to identify non-DEG cluster results in a non-ideal and incorrectly ranked gene list. This produces several extremely inferior results (11/50 = 22% of trials) when using MBCdeg1. A greater number of successful trials using MBCdeg1 (39/50 = 78% of trials) with PDEG = 0.45 and P1 = 1.0 may result due to the number of non-DEG patterns being greater than DEG1 patterns (i.e., 55:45). Similar explanations for the fewer successful cases using MBCdeg with PDEG = 0.55 and P1 = 1.0 may be valid as the number of non-DEG patterns is fewer than the number of DEG1 patterns (i.e., 45:55). MBCdeg2 shows stable and relatively accurate performance for PDEG ≤ 0.45, regardless of the value of P1 and it may be useful in these conditions. This trend was also observed when the number of replicates per group was increased to n1 = n2 = 6, 9, and 12 (Additional file 5).
Three-group simulated data
Next, we investigated the performance of MBCdeg in a multi-group comparison. Tang et al. [43] evaluated the performance of 12 DE pipelines available across nine R packages (TCC [14], edgeR [11], DESeq [44], DESeq2 [15], limma [45], samr [46], PoissonSeq [47], baySeq [12], and EBSeq [13]) and recommended the use of TCC when performing three-group count data. Therefore, we followed the conditions where TCC performed the best in the previous simulation analysis and investigated whether MBCdeg could exceed TCC as well as the other two methods (i.e., edgeR and DESeq2).
As in the evaluation study described [43], we focused on a three-group data with equal numbers of replicates (i.e., three replicates per group; n1 = n2 = n3 = 3). The other simulation conditions were G = 10,000, PDEG = 0.25, and FC = 4. For the proportion of upregulated DEGs in individual groups (P1, P2, P3), we prepared a total of four conditions: (1/3, 1/3, 1/3), (0.6, 0.2, 0.2), (0.5, 0.5, 0.0), and (1.0, 0.0, 0.0). The numbers of true clusters Ktruth for these conditions were 4, 4, 3, and 2, respectively. The first two conditions are the same as those used in the previous study, and the other two conditions were used to examine the effect on different Ktruth values (i.e., 3 and 2). We compared a total of four K values (K = 2, 3, 4, and 5) when performing an analysis using MBCdeg. For the three packages (edgeR, DESeq2, and TCC), gene ranking was performed based on an ANOVA-like p-value, where a low p-value for a gene indicates a high degree of DE in at least one of the groups compared.
Figure 4 shows the boxplots of AUC values using these methods in the four conditions described. We observed that the AUC values were the highest when K = Ktruth and that the performances with K > Ktruth were higher than those with K < Ktruth. Further, this trend was observed when the number of replicates per group was increased to n1 = n2 = n3 = 6, 9, and 12 (Additional file 6). However, for simulated data with variable FC values, AUC values for MBCdeg with K ≥ Ktruth tended to be higher than those with K = Ktruth (Additional file 7). This is consistent with the results of the two-group data (Additional file 4). Together, these results indicate that MBCdeg yields better performance with larger K values (e.g., K > I + 1) when using real data with a group number I.
Two-group real data
We examined the performance of MBCdeg for a real count dataset available from the recount2 database [48]. As with previous analyses, we compared a total of five methods: edgeR, DESeq2, TCC, MBCdeg1, and MBCdeg2. For the first three packages, genes with a false discovery rate (FDR) of 0.1 or higher were defined as non-DEGs. The other genes were identified as having DEG1 or DEG2 patterns in the direction of DE. For MBCdeg, we investigated the effect on different K values (= 3, 4, and 5). As previously described, genes belonging to the cluster with the smallest L2 Norm were considered as non-DEGs. The other genes were identified as having DEG1 or DEG2 patterns in the direction of DE, regardless of the cluster to which the genes belonged. As the true DE results are not known, we focus our attention on the similarities between the methods.
The real dataset (called “Pickrell”) consisting of 51,910 × 69 samples was designed to compare the expression levels of lymphoblastoid cell lines between 40 females and 29 males [49]. This dataset has been widely used for validation purposes [50]. Table 4 shows the results of the gene classification into one of the three patterns. Overall, MBCdeg1 with K = 5 showed (i) the maximum PDEG value of (1,291 + 3,433)/51,910 = 0.0910, and (ii) the maximum degree of asymmetry, defined as max(P1, P2), of 3,433/(1,291 + 3,433) = 0.7267. These values are within the range used in the simulation analysis (see Additional files 2 and 4). The values in parentheses indicate the average log2(FC) values for the genes. Our results showed that the values for DEG1 and DEG2 are approximately 1.8 (i.e., the average FC is about 2^1.8 = 3.5), which is within the values of FC = 2.5 and 4 introduced in the current simulation analysis.
DESeq2 and TCC showed similar results (99.8% concordance), and edgeR and MBCdeg also showed similar results (> 97% concordance). Even the lowest concordance rate was found between DESeq2 and MBCdeg1 with K = 5 at 90.86% (see “Sheet1” in Additional file 8). This high concordance rate (≥ 90.86%) among the methods is reasonable as the maximum PDEG value is 9.10% (i.e., mostly determined as non-DEGs). We observed a high similarity between the results from six MBCdeg methods, where the lowest concordance rate was found between MBCdeg1 with K = 5 and MBCdeg2 with K = 4 at 97.26%. This suggests that a cluster of DEG1 (or DEG2) patterns obtained by K = 3 is divided into sub-clusters when K = 4 or 5. Indeed, MBCdeg1 with K = 4 produced two clusters (containing 1,939 genes and 1,454 genes, respectively) assigned to the DEG2 pattern (see “Sheet2” in Additional file 8). Of a total of 3,393 genes, 2,713 genes were included in the 3,298 DEG2 genes obtained by MBCdeg1 with K = 3 (Table 4).
Similar to the analysis using MBCdeg1 with K = 4, MBCdeg2 with K = 4 produced two clusters (containing 60 genes and 1,291 genes, respectively) assigned to the DEG1 pattern (“Sheet2” in Additional file 8). Interestingly, the 1,291 genes in the second cluster were identical to the 1,291 DEG1 genes obtained using MBCdeg2 with K = 3, and 1,290 of the 1,291 genes were the same as those obtained using MBCdeg1 with K = 3–4. In the results obtained using MBCdeg2 with K = 4, we found that the representative expression patterns for the remaining 60 genes assigned to the DEG1 pattern (the cluster centers µ = (0.184, -0.184)) were very similar to those assigned to the non-DEG pattern (µ = (0.014, -0.014)) than the other 1,291 genes to the DEG1 pattern (µ = (3.117, -3.117)), given the degrees of DE. These 60 genes were assigned to non-DEG pattern at K = 3. Although we defined only one cluster with the smallest L2 Norm as of non-DEG, it may be better to define the clusters with low L2 Norm as of non-DEGs and to consider their assignments when using other K values.
In the results with K = 5, we observed that MBCdeg2 gave two DEG1 clusters containing 38 and 1,291 genes, and two DEG2 clusters containing 1,938 and 1,364 genes. Additionally, MBCdeg1 gave one DEG1 cluster containing 1,291 genes and three DEG2 clusters containing 1,923, 39, and 1,471 genes, respectively. As described earlier, the 1,291 DEG1 genes obtained by MBCdeg2 with K = 5 were identical to those obtained by MBCdeg2 with K = 3–4 and 1,290 out of the 1,291 DEG1 genes were the same as those obtained by MBCdeg1 with K = 3–5. The degrees of DE in the three DEG2 clusters obtained when using MBCdeg1 were sorted in descending order of the clusters consisting of 1,923, 39, and 1,471 genes (“Sheet2” in Additional file 8). The 1,962 genes from the first two DEG2 clusters were included in the DEG2 cluster containing 3,298 genes, that was obtained by using MBCdeg1 with K = 3. However, 751 (or 720) of the remaining 1,471 genes with the lowest degree of DE from the last DEG2 cluster, were included alone in the DEG2 (or non-DEG) cluster obtained by using MBCdeg1 with K = 3. These results suggest that the use of MBCdeg for DE analysis should take into account the fluctuation of expression patterns in clusters obtained at various K values, as well as the PPs assigned to each gene.