Due to the largely methodological nature of this study, we first outline the overall approach (Fig. 1) before providing additional details below. The approach is worked out in our primary study (Study 1), then validated on two independent datasets (Studies 2 and 3). Starting with Study 1, to compare multivariate with univariate approaches to localizing language cortex, we first created a set of multivariate regions of interest. This was done using representational fidelity (RF) analysis (Rothlein, DeGutis, & Esterman, 2018). We wanted to be inclusive at this stage, so we based the RF analysis on data from visual word presentations relative to an implicit (visual fixation) baseline. This resulted in a map of voxels showing consistent responses across words and subjects, which we used as the multivariate regions of interest (mROI). We then compared this with a widely used set of univariate regions of interest (uROI) for language, developed and made available by Fedorenko et al. (2010).
The uROI were based on a contrast of sentences > pseudowords, the results of which are thought to highlight neural areas processing semantics and syntax. To facilitate as direct a comparison as possible with the univariate approach while maintaining the use of simple stimuli on which we can exert tight experimental control, we focused on single-word semantics. Our primary analyses involved RSA (Kriegeskorte, Mur, & Bandettini, 2008), in which the predicted Representational Dissimilarity Matrices (RDM) were defined in terms of differences in semantic measures among stimuli. These predicted RDM were then compared to observed RDM defined in terms of neural responses to each stimulus within a searchlight. To account for properties of words other than semantics, analyses were conducted in terms of partial correlations that also included predicted RDM for orthographic and phonological word properties. Resulting parameter estimates were queried and compared between the mROI, uROI, and their spatial overlap. Lateralization indices were also calculated as a measure of external validity.
The mROI was defined in terms of words compared to baseline, which is independent of the predicted RDM for words defined in terms of semantics. However, the surest test of independence is to apply the mROI to different data. That is what we did in Studies 2 and 3. In Study 2, the predicted RDM was again defined in terms of semantics. In Study 3, we used a dataset that might be expected to favor the uROI approach. The stimuli were multi-word phrases, and activations were defined in terms of univariate contrasts. In all three studies, we compared activation and laterality indices for mROI, uROI, and their spatial overlap (Fig. 1).
Study 1 (Primary Experiment)
Participants
We recruited 20 neurotypical, right-handed speakers of English as a first language who were between the ages of 18 and 24 (13 female, 7 male). Mean age was 20 (SD: 1.54) years. Participants were recruited from the Rutgers University-Newark campus and completed an online screening form to assess eligibility and MRI safety. From the screening responses, eligibility was determined by absence of 1) any history of neuropsychological disorders (past or present), 2) psychoactive medication use or drug/alcohol abuse, 3) left-handedness, 4) English learned after five years old, 5) history of medical conditions that indicate neurological or physiological disturbance (e.g., severe concussion, diabetes, fainting spells), and 6) presence of metal in soft body tissue not anchored in bone. Participants provided written informed consent in accordance with Rutgers University Institutional Review Board protocol.
Stimuli
The 192 total words in the stimulus set consisted of 128 abstract and 64 concrete words. Twice as many abstract than concrete words were included because of a separate planned analysis to compare abstract words based on internal features (e.g., emotion, thought, morality) versus external features (e.g., time, space, number). Because that analysis is not relevant to the current study, we considered all 128 abstract words together. Characteristics were compared between abstract and concrete words using standard two-sample t-tests. The abstract words differed significantly from the concrete words on rated concreteness, based on a large independent set of ratings (Brysbaert, Warriner, & Kuperman, 2014), but were otherwise matched (not significantly different) on word frequency (log10-transformed occurrences per-million values), orthographic length (number of letters), number of syllables, orthographic Levenshtein distance (OLD20, average distance between a word and its 20 nearest orthographic neighbors; Yarkoni, Balota, & Yap, 2008) and bigram frequency (log10-transformed per-million values for words that share the same two-letter pair in the same position as the target word). Word frequency estimates were obtained from the SUBTLEX-US database (Brysbaert & New, 2009), number of syllables and Levenshtein distances were computed using the quanteda and vwr packages in R, respectively (Benoit et al., 2018; Keuleers, 2013), and bigram frequencies were obtained from the McWord online database of calculations based on CELEX (Baayen, Piepenbrock, & Gulikers, 1995; Medler & Binder, 2005). See Table 1 for summary of word characteristics.
Table 1
Characteristics of the word stimuli. Abstract and concrete words did not reliably differ (p > = 0.1) across any listed condition except the target factor of concreteness (p < 0.001). Values for abstract and concrete words are given as means (standard deviations, SD).
Word characteristic | Abstract | Concrete | t(190) |
Concreteness rating (1–5, low-high concreteness) | 2.45 (0.59) | 4.79 (0.19) | 30.91 |
Word frequency (log10) | 6.20 (1.87) | 6.03 (1.44) | 0.63 |
Bigram frequency (log10) | 6.44 (0.93) | 6.24 (0.89) | 1.44 |
Length (letters) | 7.45 (2.53) | 7.59 (1.60) | 0.41 |
Syllables | 2.59 (1.13) | 2.45 (0.71) | 0.91 |
OLD20 | 2.44 (0.75) | 2.64 (0.85) | 1.69 |
Task
During fMRI, participants performed a familiarity judgment task where they indicated with a button-press whether or not the word presented on the screen is one that they use or hear often (is familiar) or do not use or hear often (is unfamiliar). This task was adapted from Wang et al. (2018) and chosen 1) to encourage participants to focus on each word, up to and including its meaning, while avoiding undue engagement of additional processes such as working memory or meta-cognitive evaluation, and 2) to elicit measurable responses so as to ensure continual task engagement.
PsychoPy software was used for stimulus delivery and response collection (Peirce, 2007). Participants were given an MRI-compatible two-button box and instructed to press one button if the word was one that they use or hear others use often (familiar), and the other button if the word was one that they do not use or hear others use often (unfamiliar). An initial practice condition was included that provided examples of words that might be used or heard often (e.g., water) and words that might not be used or heard often (e.g., veal) for additional clarity. The experiment followed a randomized, event-related design. Following a similar paradigm to Wang et al. (2018), each trial consisted of the following: First, a fixation cross was presented in the middle of the screen for 500 ms, then the stimulus (word) was displayed for 1500 ms, and then another fixation cross was displayed for 500 ms. Then, the screen returned to a fixation cross for an inter-trial interval (jitter), randomly jittered for ≥ 2000 ms. Reaction time was recorded at the time of the first button press after stimulus onset.
All 192 unique word trials were fully randomized across all conditions and arranged into two “runs” (uninterrupted sets of trials with continuous image acquisition), with 96 words per run. Following these initial two runs, each word appeared twice more in subsequent runs, for a total of three times across the six runs in the experiment. Words spanned a range of frequencies (an indirect measure of familiarity) to keep participants engaged throughout the task (log10-transformed word frequency min = 1.79, max = 11.79).
MRI data acquisition and processing
Structural and functional brain data were acquired using a Siemens Trio 3 Tesla MRI scanner (Erlangen, Germany) with a 12-channel head coil at the Rutgers University Brain Imaging Center. T1-weighted (1 mm isotropic resolution) structural images were obtained using a Magnetization Prepared Rapid Gradient Echo (MPRAGE) sequence (TR = 1900 ms, TE: 2.52 ms, matrix = 256 x 256 voxels, 176 contiguous axial slices, field of view (FOV) = 256 mm). T2*-weighted (3 mm isotropic resolution) Blood Oxygen Level Dependent (BOLD) functional image slices were acquired in an interleaved order using a gradient-echo echoplanar imaging (EPI) sequence (TR = 2000 ms, TE = 25 ms, matrix = 64 x 64, 35 axial slices, FOV = 192 mm). Two hundred whole-brain volumes, each consisting of 35 axial slices, were acquired for each of the six runs.
Analysis of Functional NeuroImages (AFNI) software (Cox, 1996) and the FMRIB Software Library (FSL; Jenkinson et al., 2012) were used to preprocess neuroimaging data. Specifically, pre-processing steps prior to multivariate analysis consisted of the following: Motion correction and slice-timing correction using the AFNI script, align_epi_anat.py (Saad et al., 2009). Each of the six functional runs was aligned within-runs to the mean image, then the runs were aligned to each other with the third run as the target. For slice-timing correction, the first four time points, during which no task occurred, were ignored to avoid potential image saturation effects. The motion-corrected and slice-timing corrected runs were then concatenated together as input to the AFNI program 3dDeconvolve to generate the full design matrix. Also included as inputs to 3dDeconvolve were an inclusive mask for the EPI data, a censor for the first four TRs, and seven nuisance covariates (covariates of no interest): Six motion parameters (one each for rotation and displacement in the pitch, roll, and yaw directions), and the first principal component of signal from the lateral ventricles, as segmented using the FSL automated segmentation tool, FAST (Zhang, Brady, & Smith, 2001). The resulting design matrix and concatenated functional runs were then input to the AFNI program 3dLSS, which uses the least-squares-sum regression approach described by Mumford et al. (2012) to derive beta-weight images for each stimulus trial. These beta-weight images were re-ordered such that the images corresponding to the stimulus responses were placed in the same order for all participants. This allowed for the same representational dissimilarity matrices (RDM) to be used for each participant. The resulting images were then aligned to a common group space (Talairach space; Lancaster et al., 2000) using nonlinear diffeomorphic routines as implemented in the AFNI script, @SSwarper. Those images served as inputs for all subsequent multivariate analyses.
Representational fidelity analysis and multivariate localizer
The multivariate region of interest (mROI) was defined using pattern-based fidelity analyses, in which the basic elements of the analyses were the RDM. Fidelity analyses, as a measure of reproducibility, were performed with these RDM using leave-one-out cross-validation (as in Rothlein, DeGutis, & Esterman, 2018). Here the 20 participants read 192 words that were presented 3 times each. An observed RDM was constructed as a word by word matrix containing all the words (as shown on the left side of Fig. 1), where the elements being compared for each word comprised a vector of activations in a searchlight. The activations in the vector reflect responses to words compared to a fixation baseline, without regard to particular properties of the words. The searchlight was a sphere with a radius of 3.5 voxels, containing 123 total voxels. Representational Fidelity (RF) is computed within each searchlight by taking all the RDM (1 matrix consisting of an average across the 3 occurrences of each word x 20 participants = 20 RDM) and computing the leave-one-RDM-out reliability: correlate (RDM1, mean (RDM2 through RDM20). RF for RDM1 is the resulting correlation coefficient. This result was calculated for each voxel in the searchlight. The searchlight sphere was moved over the whole cortex, such that each gray matter voxel served as its center exactly once. This analysis results in a whole-cortex map highlighting the areas showing consistent multivariate patterns of the multiple word presentations across subjects. The resulting mROI can subsequently be used to focus analyses based on predicted RDM defined in terms of, for example, word-word differences in concreteness, imageability, or other relevant measures (right side of Fig. 1). To ensure an inclusive mROI, the RF results were thresholded at a voxel-level p < 0.05. An extent threshold of 120 voxels was also applied to remove very small, isolated peaks that were unlikely to be reliable.
Univariate localizer
For comparison with the multivariate localizer defined based on data from Study 1, we used a univariate localizer (uROI) based on separate data. This was adopted from Fedorenko et al. (2010), with the only change being that a nonlinear warp, calculated using the AFNI script @SSwarper as described above, was applied to move the uROI into Talairach space (Lancaster et al., 2000). The Fedorenko et al. localizer is based on the contrast of sentences > pseudowords (made available at https://evlab.mit.edu/funcloc/). Areas highlighted by this uROI are qualitatively distinct from the mROI in that it reliably engages superior and middle temporal gyri (outlined in white in Figs. 2 and 3).
Predicted representation matrices
The primary relationship of interest among the word stimuli was in terms of their semantics. The predicted semantic RDM was defined in terms of differences in concreteness for each word pair, where each word has a rated concreteness value (Brysbaert, Warriner, & Kuperman, 2014). The stimulus-stimulus distance matrix was defined as the absolute value of the difference in concreteness between each pair of words in the stimulus set. Stimulus-stimulus distance matrices defined in terms of phonological and orthographic edit distance measures were used to partial out effects of phonology and orthography. As expected, semantic distances were not significantly correlated with orthographic or phonological distances (|r| < 0.02, p > 0.05).
Word dissimilarities for phonology and orthography were defined in terms of their pair-wise distance as the number of edits needed to make the pair identical. To give an orthographic example, bullet and wallet have an edit distance of 2 because only “bu” and “wa” differ between them. However, wallet and jacket have an edit distance of 3 because “wal” and “jac” all differ between that pair. Phonological edit distance is defined similarly, except that phonemes are used instead of letters, and phonetic features of place and manner of articulation are also taken into account when determining whether two phonemes of a word are identical (Hall, Mackie, & Lo, 2019). Including phonetic features when calculating phonological edit distance attenuates the correlation between representations defined in terms of orthography and phonology such that orthographic and phonological distances for the current set of word stimuli are only correlated at r = 0.34. This modest level of correlation allows them to be included in the same partial correlation analysis, as we have done previously for other word stimuli (Graves et al., 2023).
Representational similarity analyses (RSA)
RSA compares the predicted RDM to the observed (neural) RDM. For this Study 1 RSA, the same neural data were used as for the RF analyses discussed above. This is justified because the RSA and RF analyses are orthogonal to each other. Whereas the RF analyses are based on the correlations among observed RDMs across participants, the RSA analyses are based on comparing predicted to observed RDMs within participants. Still, there may be concerns about independence. We therefore also include an analysis with independent datasets (see Study 2 analysis below).
To test for differences in sensitivity between the mROI and uROI in the case of multivariate analysis, we compared mean parameter estimates (beta weights for partial correlations in RSA) extracted from within the mROI, uROI, and their spatial overlap. The partial correlation RSA was conducted as a whole-cortex searchlight to test for brain areas related to semantic representations, as distinct from orthographic and phonological representations. This was done using CoSMoMVPA software (Oosterhof, Connolly, & Haxby, 2016). The observed RDM were based on vectors of neural signal intensity (beta weights). Beta values were z-score normalized across stimuli within each voxel. The observed (neural) and predicted RDMs were then compared using Spearman’s rho, and the resultant value was assigned to the center voxel of the searchlight. The searchlight was moved over the whole cortex, such that each gray matter voxel served as its center exactly once. The resulting correlation coefficient maps for each subject were then smoothed using a 5 mm full-width half-maximum kernel and entered into a 1-sample t test, before being Fisher z-transformed and thresholded at a voxel-level p < 0.005 and map-wise cluster corrected to p < 0.05. In this and subsequent studies, when querying all the ROI we aimed for stability of signal and comparability across ROI by taking only the top 20% most active voxels, as established previously (Mitsis et al., 2008). That is, comparisons among the mROI, uROI, and their overlap were carried out as comparisons among the top 20% most active voxels in each case.
To test for differences in validity between the mROI and uROI, we followed the logic outlined by Wilson et al. (2017). The left hemisphere is known to house the majority of critical cortex for language in neurotypical participants, so a more positive laterality index (LI) indicating left-lateralization is indicative of greater face validity of the results. We used the standard formula (Binder et al., 1996; Desmond et al., 1995): LI = (VLeft – VRight)/(VLeft + VRight), where in this case V is the number of significant voxels within the localizer in the given hemisphere.
Study 2
To insure against the possibility that defining the mROI using the same data as subsequent RSA analysis (albeit for the independent conditions of words compared to fixation for the mROI, and correlations with predicted RDM for RSA analysis) might lead to over-fitting or a degree of logical circularity (Kriegeskorte et al., 2009), we performed similar analyses to Study 1 in an independent data set. In Study 2, participants made lexical decisions to visually presented words. The nonword foils were pseudowords. These foils were chosen so that lexical decisions would be based primarily on whether the letter string was meaningful (a semantic criterion), as opposed to simply pronounceable (a phonological criterion) or visually familiar (an orthographic criterion). This dataset was published previously and is more extensively documented in Graves et al. (2017). A brief description of the most relevant elements follows.
Participants, task, and stimuli
A total of 20 participants (13 women, 7 men), all right-handed with English as a first language and reporting no neurological or psychiatric diagnoses or history of learning disability, gave written informed consent to participate in the study. Their mean age was 25.3 years, with 16.6 mean years of education. During fMRI scanning, participants performed a visual lexical decision task, in which participants indicated with a button press whether or not they judged the string of letters being displayed to form a valid English word. A total of 312 words and 312 pseudowords were randomly intermixed and presented across 6 runs in the experiment. The words were selected to be of either high or low frequency and high or low imageability, in a completely crossed 2 x 2 factorial design. Pseudowords were generated to contain valid English trigram (3-letter) sequences to ensure pronounceability. They did not significantly differ from words in terms of number of letters, bigram frequency, or trigram frequency.
MRI data acquisition and processing
MRI data were acquired using a 3T GE Excite system with an 8-channel array head coil. Acquisition parameters were as follows: To ensure high quality anatomical images, we acquired two T1-weighted high-resolution anatomical images, one in axial orientation with a resolution of 0.938 x 0.938 x 1.000 mm, and one in sagittal orientation (1.000 x 0.938 x 0.938 mm), each consisting of 180 contiguous slices. Functional EPI scans were acquired with 25 ms TE, 2000 ms TR, 208 mm FOV, 64 * 64 pixel matrix, in-plane voxel dimensions of 3.25 x 3.25 mm, and slice thickness of 3.3 mm with no gap. The 41 axial slices were acquired in interleaved order, and each of the 6 functional runs consisted of 140 whole-brain volumes.
The MRI data were pre-processed as described in Graves et al. (2017), including field unwarping, slice-timing correction, and motion correction. Beta-weight images were then derived for each stimulus trial using least-squares-sum regression (Mumford et al., 2012), implemented in the AFNI program, 3dLSS as described above for Study 1.
Representational similarity analyses
We performed RSA on this dataset, where the predicted RDM of interest was defined in terms of imageability, a measure of the subjective degree to which a word calls to mind a sensory impression. This measure of single-word semantics has been shown to be highly correlated with concreteness (Altarriba, Bauer, & Benvenuto, 1999). The predicted orthographic and phonological RDM were defined and calculated as described for Study 1, but for the distinct stimuli in Study 2. For the word stimuli in this dataset, the orthographic edit distance and the phonological edit distance are correlated for the set of words at r = 0.46 (p < 0.001). However, levels of multi-collinearity below r = 0.7 are generally considered to not violate the assumptions of the general linear model, of which partial correlation analyses are a special case (Kutner et al., 2005).
Study 3
An additional study was included to test the possibility that the univariate localizer would be better suited for detecting activation from univariate contrasts. Additionally, the uROI localizer based on multi-word combinations may be a better fit to data from participants tested using multi-word (in this case, article-noun-noun) combinations, whereas the mROI localizer based on single-word data may be a better fit for testing experiments using single-word stimuli. Note that the mROI and uROI used to query results (averaging activations across each voxel in the ROI) are identical to the ones used in Studies 1 above and – as in Study 2 – are defined independently of the current dataset. This dataset was published previously and is more extensively documented in Graves et al. (2010). A brief description of the most relevant elements follows.
Participants, task, and stimuli
A total of 22 participants, all right-handed with English as a first language and reporting no neurological or psychiatric diagnoses, gave written informed consent to participate in the study. Their mean age was 24.7 (SD: 5.4), with 15 females and 7 males. During fMRI scanning, participants were asked to press one button if the phrase displayed was meaningful, another if not meaningful, and a third if it was made of pseudowords. The noun-noun phrases were presented in either sensible order, e.g., the ski jacket, or reversed order, e.g., the jacket ski. They were taken from a larger human-rated set (Graves, Binder, & Seidenberg, 2013), and selected for being maximally sensible in forward but minimally sensible in reversed order. Pseudoword phrases, e.g., the rola brip, were presented as a comparison condition. The pseudowords were matched to words on the surface characteristics of length (in total number of letters) and bigram frequency (a measure of orthotactic typicality), as retrieved from MCWord (Medler & Binder, 2005). Participants were shown a total of 200 forward (meaningful) phrases, 200 reversed (non-meaningful) phrases, and 200 pseudoword phrases.
MRI data acquisition and processing
The MRI data were acquired using a 3T GE Excite scanner with an 8-channel array head coil and the following parameters: T1-weighted high-resolution anatomical images had a resolution of 0.938 x 0.938 x 1.000 mm across 134 contiguous axial slices. Functional EPI scans were acquired with 25 ms TE, 2000 ms TR, 224 mm FOV, 64 * 64 pixel matrix, in-plane voxel dimensions of 3.5 x 3.5 mm, and slice thickness of 3.0 mm with a 0.5 mm gap. The 33 axial slices were acquired in interleaved order, and each of the 4 functional runs consisted of 232 whole-brain volumes.
Subsequent processing steps were as described in Graves et al. (2010), including smoothing at 5 mm FWHM and thresholding at a cluster-corrected p < 0.05, applied to the contrast of meaningful phrases minus pseudoword phrases. Volumetric results were then mapped onto the nearest gray matter surface for display (Fig. 4) using the AFNI program 3dVol2Surf, and rendered using SUMA software (Saad & Reynolds, 2012).