We describe a new centrality-based algorithm and projected correlation distance algorithm for ranking the importance of ROIs for predicting a class variable. Both of these methods are based on nearest-neighbor projected-distance regression (NPDR), a machine learning algorithm that is able to detect statistical interactions using nearest neighbors in a high-dimensional space [15]. NPDR minimizes a contrastive loss function for pairs of samples \((i,j)\). The contrastive loss \({\delta }_{ij}\left(y\right)\) is an indicator of whether samples \((i,j)\) are in the same class or a different class based on class variable y. The contrastive loss can be penalized by LASSO or Ridge, or it can be unpenalized and P-values can be computed. Rather than using the predictor/attribute values directly in the regression, NPDR uses the difference \({\stackrel{⃑}{d}}_{ij}\left(X\right)\) (or projected distance onto the attributes X) between subjects \((i,j)\). The vector denotes the projected distance for all attributes in the set X. In the current application, the attributes are Pearson correlations between pairs of ROIs. For centrality NPDR (c-NPDR), the projected distance or diff, \({d}_{ij}\left(p\right)\), is the absolute difference between subjects \((i,j)\) for one correlation attribute p (correlation between a pair of ROIs):
$${d}_{ij}\left(p\right)=\left|{A}_{p}^{\left(i\right)}-{A}_{p}^{\left(j\right)}\right|$$
1
,
where \({A}_{p}^{\left(i\right)}\)is the correlation for subject i between a pair of ROIs, represented by p. Thus, if there are n ROIs, the NPDR design matrix consists of n(n-1)/2 attribute columns; one for each pair of ROIs.
The NPDR-selected ROI pairs can then be used in any number of centrality algorithms to rank the importance of individual ROIs. For comparison, we use the following centralities: degree, betweenness, eigenvector and integrated value of influence (IVI) [30].
The other NPDR-based method (correlation-diff-NPDR) for ranking the importance of ROIs from ROI-pair correlation data uses a more complex projected distance, \({d}_{ij}^{CD}\), but directly gives importance of individual ROIs without centrality calculations [27]. The correlation-diff (CD) or correlation projected distance for ROI r is given by
$${d}_{ij}^{CD}\left(r\right)={\sum }_{k\ne r}\left|{A}_{rk}^{\left(i\right)}-{A}_{rk}^{\left(j\right)}\right|$$
2
,
where \({A}_{rk}^{\left(i\right)}\) is the correlation between ROIs r and k for subject i. Thus, the correlation-diff for ROI r is the absolute sum of differences between r and all other ROIs. If there are n ROIs, the NPDR design matrix for Eq. (2) will have n columns as opposed to n(n-1)/2 for Eq. (1). Thus, NPDR with Eq. (2) yields importance scores for ROIs, while NPDR with Eq. (1) yields the importance of ROI pairs. In both cases (Eqs. 1 and 2), NPDR importance can be computed in terms of individual P-values or in a multivariate model with LASSO or Ridge.
In order to threshold the results of correlation-diff-NPDR and cNPDR, we use regularization and p-values. We use the LASSO penalty, also known as the L1 penalty, which is a regularization technique used in regression models to prevent overfitting and to enhance the model's prediction accuracy and interpretability. For non-penalized methods, we use a p-value adjusted cutoff, where ROI pairs that had an adjusted p-value > 0.05 are removed from the network. To threshold the random forest results, we use a cutoff of the top 100 pairs of ROIs (Fig. 2).
Correlation-diff-NPDR directly yields a list of significant individual ROIs. However, the centrality methods need an additional step to map pair importance to individual importance. The c-NPDR and c-rf methods yield lists of important pairs of ROIs, so we apply centralities to the resulting edge lists to obtain a list of important individual ROIs (Fig. 2). The significant pairs of ROIs are graphed as a network, where the nodes are ROIs and edges are defined when the ROI pairs have a correlation that affects the outcome variable (e.g., MDD). This interaction network is a way to visualize the importance of MDD nodes based on their connections and visualize local structure. We quantify the importance of individual ROIs using common centralities, degree, eigenvector, betweenness, and IVI [30]. IVI combines multiple centrality measures.
We compare the NPDR methods to a centrality version of random forest (c-rf). We use the correlation predictor data (Fig. 1d) with random forest permutation importance with 5000 trees, filter the correlation pairs to the top 200 to create a network, and then computed ROI centralities.
Simulation Method and Real Data.
Simulation Approach. We develop a random network approach to simulate correlation-based features, a fraction of which are functional or associated with the case-control status (Fig. 3). The application we have in mind is correlation between brain ROIs in resting-state fMRI studies, where correlation is calculated from the BOLD signal time-series. We do not simulate the time series, but rather directly simulate the correlations and their differences between groups. Features or predictors are correlations between pairs of ROIs rather than ROIs themselves. We note that these simulations and feature selection methods would also apply to other types of correlation-based data in other research domains.
The user specifies the number of ROIs, the number of cases and controls, the number of functional ROIs (i.e., that are associated with the outcome), the effect size and the type of underlying random network for the brain. The user can specify their own network, for example from real data, or they can generate any network from the igraph library. Initial correlation matrices are generated for each sample based on the network, where connected ROIs have higher random correlations than unconnected ROIs.
Functional nodes are chosen from the largest connected component (i.e. group of nodes such that there is a path between any pair of nodes in the group). Edges between the functional nodes (green edges in Fig. 3) are then used to create differential correlation between cases and controls (black dots in Fig. 3 heatmaps). We use a parameter called “multiway” that controls how many edges we randomly select to generate differential correlation. For example, a multiway of 2 will use only a subset of the possible edges between functional nodes (a subset of possible edges will be green). If we set multiway to the maximum, then all possible edges between functional nodes will have differential correlation (green). We use multiway = 5 in this application.
We generate replicate simulations to compare feature selection methods based on the ability to detect the ground truth functional ROIs. We use the F1 Score to test whether the top ROI features selected by a method overlap with the top functional features.
Real Data. We compare feature selection methods on data from the Tulsa 1000 (T1000), a longitudinal study at the Laureate Institute for Brain Research following 1000 individuals, including healthy individuals and those with mood and other disorders [28]. We use rs-fMRI time series for 188 MDD subjects and 47 healthy controls (HC) from T1000 (163 female and 72 male). We use the Automated Anatomical Labelling Atlas (AAL Atlas) with 87 ROIs and the Brainnetome Atlas with 246 ROIs to define consistent and interpretable mappings for selected features [29, 10]. The Brainnetome Atlas parcellates the brain based on structural and connectivity features. It is based on neuroimaging data, particularly rs-fMRI and diffusion tensor imaging (DTI) data, which reveal both the functional and structural connectivity patterns in the brain. For each atlas, we detrended the signals and averaged the time-series for the voxels within an atlas ROI.