Study selection
The PRISMA flow diagram for this review is presented in Fig. 1. The search across the five databases identified 3330 records. After deleting duplicate records, 2737 titles and abstracts were independently screened. A total of 2695 exclusions resulted from nonconformity with the eligibility criteria. The remaining 42 studies were retrieved, most directly from open sources and some from the primary authors (through e-mail). After the full-text assessment of these 42 selected studies, 17 were additionally excluded because of exclusion criteria (see Fig. 1), leaving 25 studies from the searched databases to be included in the current review.
Additionally, the manual search of other sources identified 20 records, all of which were retrieved directly from open sources and primary authors. After screening the abstracts and results, 16 studies were excluded according to the exclusion criteria, resulting in 4 studies for this review. Therefore, a total of 29 studies were included in this review.
Table 1 Study characteristics
Study
|
Country
|
EFAT
|
N (F)
|
MAGE(SD)
|
MEDU(SD)
|
Kruger et al. (42)
|
Brazil
|
Executive Function Scale for Adults (EFSA)
|
484 (376)
|
37.4 (16.4)
|
-
|
Scarfo et al. (13)
|
Australia
|
Andersons’s pediatric model.
|
133 (91)
|
29.7 (7.5)
|
-
|
Wang et al. (30)
|
China
|
“Fisherman” serious game.
|
108 (73)
|
66.8 (4.1)
|
11.9 (2.8)
|
Arioli et al. (40)
|
United States
|
Remote Characterization Module (RCM).
|
40 (23)
|
74.5 (6.5)
|
17.3 (1.3)
|
Karlsen et al. (43)
|
Norway
|
Cambridge Neuropsychological Test Automated Battery (CANTAB).
|
75 (30)
|
32.2 (13.1)
|
14.0 (2.4)
|
Ott et al. (44)
|
United States
|
National Institute of Health Cognitive Battery (NIH-CB).
|
83 (19)
|
44.3 (13.4)
|
16.4 (2.1)
|
Park & Schott. (31)
|
Germany
|
Digital version of the Trail Making Test (dTMT).
|
53 (31)
|
42.2 (22.8)
|
17.7 (4.7)
|
Wahyuningrum et al. (18)
|
Indonesian
|
Indonesian Neuropsychological Test Battery (INTB).
|
490 (294)
|
-
|
14.0 (2.8)
|
Heled et al. (45)
|
Israel
|
“Tactual Span”.
|
140 (70)
|
24.8 (3.0)
|
13.9 (1.3)
|
Pires et al. (46)
|
Portugal
|
Various EF tests (Set 1)
|
90 (76)
|
19.8 (2.8)
|
13.1 (1.5)
|
White et al. (47)
|
New Zealand
|
Various EF tests (Set 2).
|
30 (0)
|
68.1 (1.7)
|
-
|
Cotrena et al. (48)
|
Brazil
|
Melbourne Decision Making Questionnaire (MDMQ).
|
234 (101)
|
28.7 (11.2)
|
14.9 (4.4)
|
Feenstra et al. (49)
|
Netherlands
|
Amsterdam Cognition Scale.
|
248 (157)
|
49.1 (12.9)
|
-
|
Parsons & Barnett (39)
|
United States
|
Virtual apartment-based Stroop test.
|
89 (72)
|
44.2 (28.1)
|
-
|
Rijnen et al. (50)
|
Netherlands
|
Central Nervous System Vital Signs.
|
158 (90)
|
45.9 (14.4)
|
16.9 (3.3)
|
Soveri et al. (51)
|
Finland
|
Various EF tests (Set 3).
|
37 (-)
|
23.0 (2.4)
|
-
|
Ishigami et al. (37)
|
Canada
|
Attention Network Test-Interaction (ANT-I).
|
173 (91)
|
65.4 (6.5)
|
15.5 (2.7)
|
Kaller et al. (38)
|
Germany
|
Tower of London Freiburg version (TOL-F).
|
4600 (2292)
|
57.0 (13.6)
|
-
|
Cardoso et al. (52)
|
Brazil
|
Hotel Task (HT)
|
-
|
-
|
-
|
Godoy et al. (53)
|
Brazil
|
Barkley Deficits in Executive Functioning Scale (BDEFS).
|
60 (39)
|
27.3 (12.3)
|
-
|
Köstering et al. (54)
|
Germany
|
Tower of London Freiburg version (TOL-F)
|
27 (15)
|
22.6 (1.8)
|
-
|
Malloy-Diniz et al. (55)
|
Brazil
|
Brazilian version of the Barratt Impulsiveness Scale BIS-11BV.
|
3053 (1852)
|
31.7 (11.8)
|
-
|
Heaton et al. (56)
|
United States
|
National Institute of Health Cognitive Battery (NIH-CB).
|
268 (149)
|
52.3 (21.0)
|
13.4 (2.9)
|
Troyer et al. (32)
|
Canada
|
Online EF tests tool (Set 4).
|
396 (202)
|
65.0 (8.2)
|
-
|
Aalbers et al. (57)
|
Netherlands
|
Brain Aging Monitor-Cognitive Assessment Battery (BAN-COG).
|
397 (312)
|
54.9 (9.6)
|
-
|
Brunner et al. (58)
|
Norway
|
P3 No-Go wave.
|
26 (10)
|
27.5 -med
|
-
|
Kang et al. (41)
|
United States
|
Kaplan version of the Stroop test.
|
153 (63)
|
70.2 (8.0)
|
16.8 (2.6)
|
Beato et al. (59)
|
Brazil
|
Brazilian version of the Frontal Assessment Battery (FABBV).
|
275 (163)
|
66.4 (10.6)
|
8.9 (5.1)
|
Dubois et al. (36)
|
France
|
Frontal Assessment Battery (FAB).
|
163 (-)
|
62.8 (11.1)
|
-
|
Note. F = female; MAGE = mean age (years); MEDU = mean education (years); SD = standard deviation; med = median; Set 1: Working Memory, Tower, Divided Attention, Stroop, Verbal Fluency, Word List, Confrontation Naming tests, Coding and Telephone tests; Set 2 = Simon, Visuoverbal and Visuospatial N-back, Letter-memory and the Number–letter tasks; Set 3: Pro, Anti and Pro/Anti, Simon, Flanker, Forward and Backward Spatial, and 2-back tests; Set 4: Spatial Working Memory, Stroop Interference, Face-Name Association, and Number-Letter Alternation tasks.
Finally, it is also important to highlight that in both the database and manual search processes, after conducting a thorough review of the articles (42 from the databases and 20 from the manual search), a discussion among the authors was imperative before deciding on each exclusion and the reason for it.
Study characteristics
The characteristics of each study are presented in Table 1. They comprise the study reference, the country of origin, the evaluated EFAT (single test or test battery), and the sample characteristics.
The studies were conducted on nearly every continent except Africa, with the majority in Europe (10 studies), North America (7 studies), and South America (6 studies in Brazil). Asia, with 4 studies, and Oceania, with 2 studies, are also represented in this review.
Most studies detailed sample characteristics, including the number of participants, mean age, and mean education level, along with their respective standard deviations (see Table 1). For studies that involved more than one sample (36–39), the weighted means and standard deviations were calculated and are presented in Table 1.
Furthermore, the samples utilized in all studies comprised healthy, economically active participants, with mean ages ranging from 19.8 to 74.5 years. Although only six studies specifically targeted older individuals (mean age ≥ 65 years), even within these studies, a portion of the participants belonged to the economically active age bracket. Hence, all the studies precisely met the criteria for the target population of the current review.
Risk of bias in the reviewed studies
Biases in each reviewed study were assessed via the adapted version of the QUADAS-II tool (35), which is provided in Appendix B. The detailed results for each study can be found in Table 2, whereas the results per domain are summarized in Figure 2.
In the domain of participant selection, almost all studies did not employ random selection methods. Additionally, in six studies, information regarding participant selection was inadequately documented. Convenience sampling emerged as a prevalent method in these studies, posing a significant risk of bias.
Apart from three out of 29 studies, the risk of bias for the index test domain (referred to as the EFAT in this review) appeared to be acceptable. However, in the study by Arioli et al. (40), the risk of bias was deemed high for two reasons. First, the results of the novel EFAT (Remote Characterization Module) were evaluated with prior knowledge of the results from conventional tests (reference standard) administered to the same participants before recruitment. Second, thresholds for reliability and validity were not predefined, which is crucial when assessing novel EFATs. Similarly, in the study by Kang et al. (41), no thresholds or reference values were provided for assessing the half-split reliability of the Kaplan version of the Stroop test. Nevertheless, the author classified the results for split-half reliability as good or high (without citing the reference to the literature).
In the domain of the reference standard, 11 out of 29 studies focused exclusively on assessing the test‒retest reliability of the EFAT under investigation, thus not utilizing a reference standard. In these instances, the risk of bias was categorized as not applicable (N/A).
Furthermore, two studies (37,42) lacked sufficient information regarding the use of a reference standard. Ishigami et al (37) failed to provide details on when and how participants underwent EF tests and measures, which served as the reference standard. Similarly, Kruger et al. (42) did not adequately justify their selection of the Dysexecutive Questionnaire (DEX) as the reference standard for evaluating the validity of their novel Executive Function Scale for Adults (EFSA). Clarifying these aspects would enhance the comprehensiveness and validity of their studies.
Table 2 Risk of bias in the reviewed studies
In the domain of flow and timing, 20 out of 29 studies were assessed as having a low risk of bias, as there was no delay observed between the administration of the index test (EFAT) and the reference standard (conventional gold standard EFAT). Conversely, four studies (18,36,55,56) lacked sufficient information to assess the risk of bias. This deficiency stemmed from the inadequate provision of details concerning the guidance provided to participants during test execution, as well as the uniformity of procedures across all participants, including whether they underwent both the index test and reference standard. Finally, five studies (32,38,47,50,51) were deemed to have a high risk of bias due to inconsistencies in the testing procedures administered to participants.
Concerning applicability, within the domain of the index test (the investigated EFAT), all studies were evaluated as having a low risk of bias. In contrast, regarding participant selection, six studies lacked sufficient information for bias assessment. Finally, within the domain of the reference standard, no studies were determined to exhibit a high level of bias. Moreover, 11 studies were deemed not applicable for bias assessment because they did not utilize reference standards, whereas the remaining studies demonstrated a low level of bias.
Finally, in the final study selection process, during the comprehensive article review stage, rigorous quality control (considering the robustness of each study's method and results, as well as the presentation of their limitations) was employed by both independent reviewers, resulting in a significant reduction in the original number of records (see Figure 1). Therefore, the outcome of the analysis of the included studies' quality (Tables 2 and Figure 3) demonstrated a relatively acceptable level of bias.
Results of the reviewed studies
The 29 studies reviewed in this analysis examined the psychometric properties of various EFATs for assessing EF, including both existing and newly developed measures. The respective EF components evaluated by each tool are outlined in Table 3a, while Table 3b presents an overview of the investigated psychometric properties.
Table 3a Assessed Components for each EFAT
Study
|
EFAT
|
Assessed EF components
|
Kruger et al. (42)
|
Executive Function Scale for Adults (EFSA)
|
Inhibition, working memory, and cognitive flexibility.
|
Scarfo et al. (13)
|
Andersons’s pediatric model.
|
Attentional Control, Cognitive Flexibility, Information Processing, and Goal Setting.
|
Wang et al. (30)
|
“Fisherman” serious game.
|
Inhibition, shifting, and working memory.
|
Arioli et al. (40)
|
Remote Characterization Module (RCM).
|
Verbal memory, language fluency, working memory span, and set shifting.
|
Karlsen et al. (43)
|
Cambridge Neuropsychological Test Automated Battery (CANTAB).
|
Visual learning and memory, executive function, and visual attention.
|
Ott et al. (44)
|
National Institute of Health Cognitive Battery (NIH-CB).
|
Cognitive flexibility, attention and executive function, episodic memory, working memory, processing speed, language, motor dexterity, and fluid composite
|
Park & Schott. (31)
|
Digital version of the Trail Making Test (dTMT).
|
Perceptual speed, visual sequencing ability, mental flexibility, visual working memory, inhibition.
|
Wahyuningrum et al. (18)
|
Indonesian Neuropsychological Test Battery (INTB).
|
Working memory, cognitive flexibility, divided attention, inhibition, verbal fluency, short-term visual-spatial memory, and sustained attention.
|
Heled et al. (45)
|
“Tactual Span”.
|
Working memory.
|
Pires et al. (46)
|
Various EF tests (Set 1)
|
Inhibition, planning, working memory, divided attention, verbal fluency, episodic memory, confrontation naming, and processing speed.
|
White et al. (47)
|
Various EF tests (Set 2).
|
Inhibition (Anti), response switching (Pro/Anti), selective attention (Simon and Flanker), and working memory (2-back).
|
Cotrena et al. (48)
|
Melbourne Decision Making Questionnaire (MDMQ).
|
Decision making (vigilance, hypervigilance, buck-passing and procrastination).
|
Feenstra et al. (49)
|
Amsterdam Cognition Scale.
|
Attention, information processing, learning and memory, executive functioning, and psychomotor speed.
|
Parsons & Barnett (39)
|
Virtual apartment-based Stroop test.
|
Inhibition, cognitive flexibility, and attentional control.
|
Rijnen et al. (50)
|
Central Nervous System Vital Signs.
|
Verbal memory, visual memory, processing speed, psychomotor speed, reaction time, complex attention, and cognitive flexibility.
|
Soveri et al. (51)
|
Various EF tests (Set 3).
|
Inhibition, working memory, and set shifting (cognitive flexibility).
|
Ishigami et al. (37)
|
Attention Network Test-Interaction (ANT-I).
|
Attentional Control (alerting, orienting and executive control network).
|
Kaller et al. (38)
|
Tower of London Freiburg version (TOL-F).
|
Planning ability.
|
Cardoso et al. (52)
|
Hotel Task (HT)
|
Planning, organization, self-monitoring and cognitive flexibility.
|
Godoy et al. (53)
|
Barkley Deficits in Executive Functioning Scale (BDEFS).
|
Self-management of time, self-organization/problem solving, self-restraint, self-motivation, and self-regulation of emotion.
|
Köstering et al. (54)
|
Tower of London Freiburg version (TOL-F)
|
Planning ability.
|
Malloy-Diniz et al. (55)
|
Brazilian version of the Barratt Impulsiveness Scale BIS-11BV.
|
Attention, inhibition, motor and nonplanning impulsiveness.
|
Heaton et al. (56)
|
National Institute of Health Cognitive Battery (NIH-CB).
|
Cognitive flexibility, attention and executive function, episodic memory, working memory, processing speed, language, motor dexterity, and fluid composite.
|
Troyer et al. (32)
|
Online EF tests tool (Set 4).
|
Working memory, inhibition, cognitive flexibility and processing speed.
|
Aalbers et al. (57)
|
Brain Aging Monitor-Cognitive Assessment Battery (BAN-COG).
|
Working memory (“Conveyer Belt”), visuospatial short-term memory (“Sunshine”), episodic recognition memory.
|
Brunner et al. (58)
|
P3 No-Go wave.
|
Inhibition (P3NoGo Early) and monitoring of actions (P3 NoGo Late).
|
Kang et al. (41)
|
Kaplan version of the Stroop test.
|
Inhibition, selective attention, and processing speed.
|
Beato et al. (59)
|
Brazilian version of the Frontal Assessment Battery (FABBV).
|
Conceptualization, mental flexibility, motor programming, sensitivity to interference, inhibitory control, and environmental autonomy.
|
Dubois et al. (36)
|
Frontal Assessment Battery (FAB).
|
Conceptualization, mental flexibility, motor programming, sensitivity to interference, inhibitory control, and environmental autonomy.
|
Note. EFAT: Executive Function Assessment Tools; Set 1: Working Memory, Tower, Divided Attention, Stroop, Verbal Fluency, Word List, Confrontation Naming, Coding and Telephone tests; Set 2: Pro, Anti and Pro/Anti, Simon, Flanker, Forward and Backward Spatial, and 2-back tests; Set 3 = Simon, Visuoverbal and Visuospatial N-back, Letter-memory and the Number–letter tasks; Set 4: Spatial Working Memory, Stroop Interference, Face-Name Association, and Number-Letter Alternation tasks.
Table 3b Assessment of Psychometric Properties in the Reviewed Studies
Study
|
EFAT
|
Validity
|
Reliability
|
FA
|
Norms
|
Kruger et al. (42)
|
Executive Function Scale for Adults (EFSA)
|
Crit.
|
IC
|
CFA, EFA
|
No
|
Scarfo et al. (13)
|
Andersons’s pediatric model.
|
Predictive
|
No
|
No
|
No
|
Wang et al. (30)
|
“Fisherman” serious game.
|
Crit.
|
IC, S-H
|
CFA, EFA
|
No
|
Arioli et al. (40)
|
Remote Characterization Module (RCM).
|
Constr.
|
Intermeth
|
No
|
No
|
Karlsen et al. (43)
|
Cambridge Neuropsychological Test Automated Battery (CANTAB).
|
No
|
T-Rt
|
No
|
No
|
Ott et al. (44)
|
National Institute of Health Cognitive Battery (NIH-CB).
|
Constr.
|
No
|
No
|
No
|
Park & Schott. (31)
|
Digital version of the Trail Making Test (dTMT).
|
Constr.
|
IC
|
No
|
No
|
Wahyuningrum et al. (18)
|
Indonesian Neuropsychological Test Battery (INTB).
|
No
|
IC, T-Rt
|
PCA
|
Age-Edu related
|
Heled et al. (45)
|
“Tactual Span”.
|
Constr.
|
IC
|
EFA
|
No
|
Pires et al. (46)
|
Various EF tests (Set 1)
|
No
|
No
|
CFA
|
No
|
White et al. (47)
|
Various EF tests (Set 2).
|
No
|
T-Rt
|
No
|
No
|
Cotrena et al. (48)
|
Melbourne Decision Making Questionnaire (MDMQ).
|
Constr., Crit.
|
IC
|
CFA
|
No
|
Feenstra et al. (49)
|
Amsterdam Cognition Scale.
|
No
|
T-Rt
|
No
|
Equating
|
Parsons & Barnett (39)
|
Virtual apartment-based Stroop test.
|
No
|
No
|
No
|
No
|
Rijnen et al. (50)
|
Central Nervous System Vital Signs.
|
No
|
T-Rt
|
No
|
No
|
Soveri et al. (51)
|
Various EF tests (Set 3).
|
No
|
T-Rt
|
No
|
No
|
Ishigami et al. (37)
|
Attention Network Test-Interaction (ANT-I).
|
Constr., Crit.
|
S-H
|
No
|
No
|
Kaller et al. (38)
|
Tower of London Freiburg version (TOL-F).
|
No
|
S-H, T-Rt
|
No
|
No
|
Cardoso et al. (52)
|
Hotel Task (HT)
|
No
|
No
|
No
|
No
|
Godoy et al. (53)
|
Barkley Deficits in Executive Functioning Scale (BDEFS).
|
Constr., Crit.
|
IC
|
No
|
No
|
Köstering et al. (54)
|
Tower of London Freiburg version (TOL-F)
|
No
|
T-Rt
|
No
|
No
|
Malloy-Diniz et al. (55)
|
Brazilian version of the Barratt Impulsiveness Scale (BIS-11BV).
|
No
|
IC
|
No
|
Percentiles
|
Heaton et al. (56)
|
National Institute of Health Cognitive Battery (NIH-CB).
|
Constr., Crit.
|
IC, T-Rt
|
No
|
No
|
Troyer et al. (32)
|
Online EF tests tool (Set 4).
|
Constr.
|
IC, S-H, T-Rt
|
PCA
|
Age-related
|
Aalbers et al. (57)
|
Brain Aging Monitor-Cognitive Assessment Battery (BAN-COG).
|
Constr.
|
IC, AFR
|
No
|
No
|
Brunner et al. (58)
|
P3 No-Go wave.
|
No
|
T-Rt
|
No
|
No
|
Kang et al. (41)
|
Kaplan version of the Stroop test.
|
Constr.
|
S-H
|
No
|
Age-related
|
Beato et al. (59)
|
Brazilian version of the Frontal Assessment Battery (FABBV).
|
No
|
No
|
No
|
Percentiles
|
Dubois et al. (36)
|
Frontal Assessment Battery (FAB).
|
Constr., Crit.
|
Interrater
|
No
|
No
|
Note. EFAT: Executive Function Assessment Tools; FA: Factor Analysis; Crit. : criterion; IC: internal consistency; CFA: confirmatory factor analysis; EFA: exploratory factor analysis; S-H: split-half; Constr. : construct; Intermeth: intermethod; T-Rt: test-retest; PCA: principal component analysis; Age-Edu: Age and education; Set 1: Working Memory, Tower, Divided Attention, Stroop, Verbal Fluency, Word List, Confrontation Naming, Coding and Telephone tests; Set 2: Pro, Anti and Pro/Anti, Simon, Flanker, Forward and Backward Spatial, and 2-back tests; Set 3: Simon, Visuoverbal and Visuospatial N-back, Letter-memory and the Number–letter tasks; Set 4: Spatial Working Memory, Stroop Interference, Face-Name Association, and Number-Letter Alternation tasks; AFR: Alternate Form Reliability.
Notably, two studies (39,52) diverge from this focus. Parsons and Barnett (39) investigated a virtual reality adaptation of the Stroop test (referred to as the virtual apartment-based Stroop), comparing its performance with both the Automated Neuropsychological Assessment Metrics (ANAM) Stroop test and the Paper-and-Pencil (P&P) color-word interference test from the Delis-Kaplan Executive System battery (D-KEFS). Conversely, Cardoso et al. (2015) conducted only a transcultural adaptation of the hotel task into Brazilian Portuguese. Despite not specifically examining psychometric properties, these studies were included in the current review because of their relevance to the objectives of the present study.
Regarding the frequency of tests administered in the studies, a total of 199 administrations were conducted in the 29 reviewed studies, covering a total of 11,386 healthy adult participants. Most tests employed, which serve to assess the validity of the target instruments investigated in the articles, belong to established batteries of neuropsychological assessments, including the Wechsler scales (Wechsler Adult Intelligence Scale and Wechsler Memory Scale, the Delis-Kaplan Function System (D-KEFS), and the Cambridge Neuropsychological Test Automated Battery (CANTAB), among others).
In the following sections, the most assessed EF components and associated EFATs, as well as the frequency of batteries of EFATs employed in the reviewed studies, are presented. The main findings from the 29 studies are subsequently presented and organized into two distinct categories for clarity in data presentation: novel EFATs (13 studies in total) and existing EFATs (16 studies in total). In the first section, the new tools (or batteries) are introduced, followed by the presentation of the results from their psychometric analyses. In the second section, the results of the psychometric analysis of the existing EFAT are presented exclusively since they are well renowned by the scientific community. Finally, considerations about results bias are also presented for each study.
The most commonly assessed EF components and associated EFATs in the reviewed studies
Table 4 highlights the most employed tests as well as the EF components, which most represented the targets in the reviewed studies. The EFAT investigated in the reviewed studies focused on assessing the three primary components of EF: inhibition (21 studies), working memory (18 studies), and cognitive flexibility (16 studies). The associated tests most employed were the Stroop test in several versions: the digit span test, which is mostly from the Wechsler scale (60), and the Trail Making Test.
Table 4 Frequency of Research on EF Components and Associated EFATs in the Reviewed Studies
EF Component
|
No. of studies
(% of the total)
|
Most EFAT (No. of administrations)
|
Inhibition
|
20 (69.0%)
|
Stroop (10), TMT-B (4), Go/NoGo (3)
|
Working memory
|
18 (62.1%)
|
WAIS Digit Span (11), WAIS Word list (7), WAIS Number-Letter (6)
|
Cognitive flexibility
|
16 (55.2%)
|
TMT (8), WAIS Number-Letter (6), Stroop (2), WCST (2),
|
Processing speed
|
15 (51.7%)
|
Stroop (10), WAIS Digit Span (11), WAIS Digit Symbol Coding (5)
|
Attentional control
|
13 (44.8%)
|
WAIS Digit Symbol Coding (5)
|
Verbal Fluency
|
12 (41,4%)
|
Verbal Fluency (12)
|
Planning
|
10 (34.5%)
|
Tower of London (3)
|
Episodic memory
|
8 (27.6%)
|
WAIS Word list (7)
|
Psychomotor abilities
|
4 (20.7%)
|
FAB Motor Series (2)
|
Note. EFAT: Executive function assessment tool; most studies administered more than one test for the same component. EF: Executive Functions; TMT-B: Trail Making Test Part B; TMT: Trail Making Test; WCST: Wisconsin Cards Sorting Test; FAB: Frontal Assessment Battery; WAIS: Wechsler Adult Intelligence Scale.
Some studies have examined the role of these components with more complex executive processes, such as planning and problem-solving (18,38,49,52,54,57,58), decision-making (18,41,48), and abstract reasoning (13,37). Interestingly, the study by Scarfo et al. (2023) examined the influence of EF components in goal setting, a construct that comprises initiative, abstract reasoning, planning, and strategic organization.
Furthermore, several batteries were administered in the reviewed studies, as described in the next section.
Batteries of EFAT employed in the reviewed studies
In the reviewed studies, researchers utilized tests (or subscales) from various standardized batteries, demonstrating the significance of these batteries for EF assessment in neurotypical adults. Table 5 illustrates the results. The spreadsheet containing detailed data from all tests conducted in the reviewed studies is available in the supplementary material.
The most frequently employed test battery in the reviewed studies was the Wechsler Adult Intelligence Scales (WAIS), which was used in eight studies (13,30,40,41,45,46,56,57). The second and third most frequently used batteries were tied, with both cited in four studies: the Wechsler Memory Scales (WMS; 41,44,46,57) and the Delis–Kaplan Executive Function System (D-KEFS; 37,39,46,56).
Table 5 Frequency of test batteries used in the reviewed studies
Test battery
|
Frequency
(n° studies)
|
Wechsler Adult Intelligence Scales (WAIS-R, WAIS-III, WAIS-IV, and WASI)
|
8
|
Wechsler Memory Scales (WMS-R, and WMS-III)
|
4
|
Delis-Kaplan Executive Function System (D-KEFS)
|
4
|
Behavioral Assessment of Dysexecutive Syndrome measures (BADS)
|
2
|
Cambridge Neuropsychological Test Automated Battery (CANTAB)
|
2
|
Frontal Assessment Battery (FAB)
|
2
|
National Institute of Healthy Toolbox Cognition Battery (NIHTB-CB)
|
2
|
Amsterdam Cognition Scan (ACS)
|
1
|
Army Individual Test Battery
|
1
|
Automated Neuropsychological Assessment Metrics (ANAM)
|
1
|
Brain Aging Monitor–Cognitive Assessment Battery (BAM-COG)
|
1
|
Central Nervous System Vital Signs (CNS VS)
|
1
|
Cognitive Test Battery (CTB)
|
1
|
FISHERMAN Serious Game
|
1
|
Indonesian Neuropsychological Test Battery (INTB)
|
1
|
Psycholinguistic Assessment of Language battery (PAL 09)
|
1
|
Test of Everyday Attention battery (TEA)
|
1
|
Note. WAIS-R, WAIS-III, and WAIS-IV are different versions of the Wechler Adult Intelligence Scale; WASI: Wechsler Abbreviated Scale of Intelligence; WMS-R and WMS-III are different versions of the Wechsler Memory Scale.
Other batteries, including the Behavioral Assessment of Dysexecutive Syndrome (BADS), the Cambridge Neuropsychological Test Automated Battery (CANTAB), the Frontal Assessment Battery (FAB), and the National Institutes of Health Toolbox Cognition Battery (NIHT-CB), were each employed in two studies. The remaining batteries, some of which are newly developed (described in the section below), were used in only one study each.
Novel EFAT
Among the 29 studies, 13 presented new EFATs, two of which focused on different psychometric analyses of the same tool. These new tools are briefly described below. Table 6 shows the results of their investigated psychometric properties.
Table 6 Psychometric Properties of Studies on Novel EFATs
Study
|
Tool
|
Psychometric properties
|
Kruger et al. (42)
|
EFSA
|
Reliability: acceptable to excellent internal consistency for two components, WM and IC (Cronbach’s α = .90 and .79, respectively), and low for CF (α = .62.).
Validity: significant correlation with the DEX self-report scale, with strong correlation for WM, moderate for IC and weak for CF (ρ = .71, .59, and .17, respectively).
CFA/EFA: moderate correlation between factors, with strong correlation between WM e IC, and weak correlation between WM e CF and IC e CF as well.
|
Wang et al. (30)
|
Fisherman
|
Reliability: revealed good internal consistency of the Cautions, Agile, and Wise subgames (Cronbach’s α = .83, .89, and .80, respectively). The split-half reliability of all subgames was acceptable to good (.83, .88, and .77, respectively).
Validity: all three subgames showed significant moderate correlations: 1) Cautions Fisherman with stop signal (r = .40), and Trail Making Test Part A (TMT– A; r = .38); 2) Agile Fisherman with Number switch task (r =.51), and TMT-A (r = .40); and 3) Wiseman Fisherman with Corsi-Block-Tapping task (r = .75), and Digit backward span (r =.41). Among the subgames, the Cautions Fisherman and the Agile Fisherman were modestly correlated (r = .45).
|
Arioli et al. (40)
|
RCM
|
Reliability: intermethod revealed significant correlations (p<.05) for verbal memory, language fluency, and set shifting (r = .45 to .61, indicating low reliability), but not for working memory span (p = .19 to .21).
Validity: no significant differences between RCM and P&P in four tasks (Short- and Long-Delay Free Recall, Lexical fluency, and Modified Trail Making B); significant differences for the other four tasks (Total Immediate Recall, Semantic Fluency, and Verbal Digit Span Forward and Backward), being modest for Total Immediate Recall (d = .35), whereas the remaining tasks exhibited robust differences (d = 1.28, 1.19, and .53, respectively).
|
Park and Schott (31)
|
dTMT
|
Reliability: good to excellent (ICC between .90 e .95) for all three conditions TMT-M, TMT-A, and TMT-B.
Validity: showed good agreement (ρ = .82 to .90) and equivalence as well between the P&P and digital TMT, since no significant difference was found for both, the difference (B-A) e ratio (B/A).
|
Wahyuningrum et al. (18)
|
INTB
|
Reliability: low to excellent reliability (ICC = .44 to .91), PCA revealed a well-fitting seven-factors: Factor 1 - speed of visuospatial information processing and planning, Factor 2 - attention, auditory short-term, and working memory, Factor 3 - executive internal language, Factor 4 - visual cued semantic processing, Factor 5 speed and inhibitory control, Factor 6 - recall ability, and Factor 7 - learning ability.
Normative: age demonstrated a significant effect (p > .01) on six out of the seven factors, except for factor 3 (p = .754); education showed significant influence on five factors, but not on factors 6 and 7.
|
Heled et al. (45)
|
Tactual Span
|
Reliability: Cronbach’s α for correct scores in all Tactual Span trials was .77 for the forward stage and .80 for the backward stage, indicating acceptable reliability.
Validity: low to moderate convergent validity (ρ = .19 to .47) and good discriminant validity (no significant correlations between Tactual Span and selective attention and semantic fluency tasks).
EFA: revealed two factors for the forward stage (F1: visual and auditory spans and F2: visuospatial and actual spans) and just one factor for the backward stage comprising all span task. This is aligned with the Baddeley’s model (8).
|
Feenstra et al. (49)
|
ACS
|
Reliability: low to acceptable test-retest reliability (ICC = .45 to .80 and .83 for the ACS total score).
Multiregression analysis: significant effects of age, gender, and education on total ACS scores, with standardized β coefficients of -.597, .111, and -.112, respectively.
|
Parsons and Barnet (39)
|
VABS
|
Convergent validity: the test proved to be valid for evaluating cognitive interference control, with much more challenging than in traditional P&P tests without distractors.
|
Ishigami et al. (37)
|
ANT-I
|
Reliability: all three networks was significant (p<,05): alerting (r = .29), orienting (r = .70), and executive control (r = .68), but indicating low to acceptable reliability.
Construct validity: each network score was significant (p < .01) and independent from each other.
Criterion validity: executive network was a significant predictor of performance on conflict resolution and verbal memory and retrieval (β = −.165 and −.184, p’s < .05, respectively).
|
Kaller et al. (38)
|
TOL-F
|
Reliability: acceptable reliability for all estimates (λ2, λ3, λ4, ωtot, and glb) in both samples, Mainz (.713 - .755) and Vienna (.656 - .730). The estimate glb converged for both samples (.755 for Mainz and .730 for Vienna), as did the estimate λ4 (.743 for Mainz e .716 for Vienna).
|
Köstering et al. (54)
|
TOL-F
|
Reliability: planning accuracy test-retest for relative consistency (r = .734 to .739) as well as absolute agreement (r = .690) were acceptable; planning latencies, however, showed low reliability for both, relative consistency, and absolute agreement (r = .274 to .519).
|
Troyer et al. (32)
|
Online EF tests tool
(Set 4)
|
Feasibility: 87% of participants successfully completed the tool, with 94% of completed tests yielding results within the anticipated range.
Reliability: split-half reliability was low (0.62); high internal consistency (Cronbach’s α = .96); low to moderate test-retest reliability (r = .49 to .83 for individual tasks, and r = .72 for overall score); and finally, low to acceptable alternate version reliability (r = .48 to .82 for individual tasks, and r = .69 for overall score).
Validity: for construct, significant correlation between age and target measures was low to moderate, with r(394) = -.20 to .31; for convergent, the intertask correlations of the target measures were low to moderate, with r(394) = .27 to .30.
PCA: conservatively identified a single factor solution, but there was evidence also for two-factor solution in the context of speeded responding.
|
Aalbers et al. (57)
|
BAN-COG
|
Reliability: AFR was low with ICC = .420, .426, and .645 for the Conveyer Belt, Sunshine and Papyrinth games, respectively, but not for the Viewpoint game (ICC = .167).
Convergent construct validity: moderate (ρ = .400 to .669), but not for the Viewpoint subgame.
Divergent construct validity: good for all subgames (ρ < .200 with unrelated cognitive measure).
|
Note. EFSA = Executive Function Scale for Adults; WM = Working Memory; IC = Inhibitory Control (or inhibition); CF = Cognitive Flexibility; DEX = Dysexecutive Questionnaire; ρ = Spearman correlation coefficient; CFA = Confirmatory Factor Analysis; EFA = Exploratory Factor Analysis; r = Pearson correlation coefficient; TMT-A = Trail Making Test Part A; RCM = Remote Characterization Module; P&P = Paper-and-Pencil; d = Cohen’s effect size coefficient; dTMT = Digital Trail Making Test; TMT-M = Tral Making Test Motor Speed; TMT-B = Trail Making Test Part B; INTB = Indonesian Neuropsychological Test Battery; ICC = Intraclass Coefficient; PCA = Principal Component Analysis; ACS = Remote Characterization Module; β = Standardized Regression Coefficient; VABS = Virtual Apartment-Based Stroop; ANT-I = Attention Network Test-Interaction; TOL-F = Tower of London Freiburg version; λ2, λ3, λ4, glb, and ωtot = reliability indicators to assess internal consistency; Set 4: Spatial Working Memory, Stroop Interference, Face-Name Association, and Number-Letter Alternation tasks; BANCOG = Brain Aging Monitor-Cognitive Assessment Battery. AFR: Alternate Form Reliability.
The Executive Function Scale for Adults (EFSA) is a 27-item scale that assesses three EF components: working memory (WM), inhibitory control (IC) and cognitive flexibility (CF). Kruger et al. (42) investigated preliminary evidence of the validity of the EFSA and convergent validity on the basis of the relationship of the scale with external variables. Although the EFSA is the first developed exclusively for this population in the country, to the best of the authors’ knowledge, online implementation on a convenience scale may represent bias in the results.
The Fisherman game is a serious game consisting of three subgames: Cautions Fisherman, Agile Fisherman, and Wise Fisherman, each measuring inhibition, shifting, and working memory, respectively. Wang et al. (30) investigated the reliability, construct validity and criterion validity of the Fisherman and its subgames. The Fisherman method exhibits great promise as a tool for self-administered assessment and has shown high reliability. However, the absence of an explanation regarding the treatment of outliers in the results could pose a potential risk of bias.
The remote characterization module (RCM) is a digital cognitive screening tool devised by Arioli et al. (40). It comprises an application featuring eight tasks designed to assess verbal memory, language fluency, working memory span, and set shifting. Arioli et al. (2022) assessed RCM construct validity and intermethod reliability. They acknowledged that the lack of participants with lower levels of education limits the test's reliability. Additionally, the assumption of a normal distribution in the scores, without providing justification, raises concerns about potential bias in the findings, particularly given the small sample size.
The digital TMT (dTMT) is a computerized version of the Trail Making Test (TMT), which is a commonly used neuropsychological test for the assessment of visuomotor ability and mental flexibility, in addition to working memory and inhibition (61). In their study, Park and Schott (31) evaluated and compared the psychometric properties (reliability, equivalence, and agreement) of the P&P version of the TMT and the dTMT under three conditions: TMT-M (motor speed), TMT-A (Part A of the test), and TMT-B (part B of the test). Although the TMT displayed strong psychometric properties on the basis of thorough statistical analysis, the small sample size may have introduced bias into the results.
The Indonesian Neuropsychological Test Battery (INTB) consists of ten P&P tests targeting executive functions; reception and production of language; various types of learning and memory, both verbal and visuospatial; and attention and concentration. Wahyuningrum et al. (18) investigated the test-retest reliability of the INTB and performed a principal component analysis (PCA) to identify its cognitive construct factor structure and its respective age and education effects. While this study yields promising results for the performance of the INTB, it is important to acknowledge the overlooked aspect of how practice effects could influence test-retest reliability outcomes.
The Tactual Span, as a new modality of the WM test, measures working memory capacity; however, unlike the traditional WM test, it considers a third storage modality beyond the verbal and visuospatial storage proposed by (7,8). Heled et al. (45) investigated the reliability and construct validity of the Tactual Span. They also performed exploratory factor analysis (EFA) to examine whether Baddeley’s working memory (WM) model aligns with the multimodal span tests employed in the study. The results indicated that the Tactual Span emerged as a promising modality for assessing WM. However, it is important to note a potential risk of bias in the findings, given the homogeneous nature of the sample (composed mostly of undergraduate students).
The Amsterdam Cognition Scan (ACS) is an online self-administered neuropsychological test battery developed by Feenstra et al. (49) that measures attention, information processing, learning and memory, executive functioning, and psychomotor speed. They investigated the test-retest reliability of the RCM and established its regression-based demographically corrected normative data. The findings demonstrated a range of reliability from moderate to high. However, the presence of nonuniform conditions among the participants could introduce bias. For example, during the test, 40 out of the 235 participants were not alone during the test, which may have influenced the results.
The Virtual Apartment-Based Stroop (VABS) test is a VR variant of the Stroop test, where the Stroop stimulus is presented on television inside a virtual apartment (containing a kitchen and a window) with various distractors (auditory and visual). These distractors are positioned in different fields of view in the environment, such as a school bus passing by the street and visible from the window, a robot vacuum moving and generating noise on the floor, and a phone ringing on the table in front of the participant. Parsons and Barnet (39) evaluated the performance of the VABS compared with that of the ANAM and D-KEFS Stroop tests in older and young adults. Since the ANAM and D-KEFS Stroop tests do not include environmental distractors, there is a potential risk of bias in comparing the scores of interference inhibition tests where participants are under different conditions.
The attention network test-interaction (ANT-I) is a computerized variant of the attention network test (ANT; 62) proposed by Callejas et al. (63). The ANT target measures are the three attention networks (alerting, orienting and executive control) of the theoretical model by Petersen and Posner (64). Ishigami et al. (37) investigated the construct and criterion validity and split-half reliability of the ANT-I. This author did not include RT in the regression analyses of criterion validity, indicating that this variable could strengthen the identified relationship. However, the author could have used mediation/moderation analysis to determine the influence level of the RT.
The Tower of London Freiburg version (TOL-F) represents a computerized variant of the classic Tower of London task (TOL; 5), where participants engage with digital representations of colored balls on a computer screen to solve planning problems. Köstering et al. (54) delved into the test‒retest reliability of the TOL-F, employing three Iso-versions that maintained structural consistency while permuting ball colors to ensure novelty across repeated assessments. One potential concern regarding bias in the findings of this study is that the authors relied on a relatively small and homogenous sample consisting primarily of university students. Furthermore, Kaller et al. (38) provided insights into the test-retest and split-half reliability, as well as construct validity estimates of the TOL-F, drawing from large samples in Mainz, Germany, and Vienna, Austria.
The online EF test tool is a self-administered internet-based screening tool tailored for middle-aged and elderly individuals. Devised by Troyer et al. (32), the tool consists of four tests: spatial working memory, Stroop interference, face-name association, and number-letter alternation tasks. The authors investigated its feasibility, reliability, and construct validity and presented normative data. The authors aimed to create a self-assessment tool capable of informing an individual's decision regarding the need to seek healthcare services to evaluate potential memory deficits. They also recognized the risk of bias in their findings, given the myriad factors that can influence outcomes in an uncontrolled environment.
Finally, the Brain Aging Monitor-Cognitive Assessment Battery (BAM-COG) is an online tool consisting of four puzzle games: Conveyer Belt, Sunshine, Viewpoint, and Papyrinth, each of which assesses WM, visuospatial short-term memory, episodic recognition memory, and planning ability, respectively. Aalbers et al. (57) examined the reliability and construct validity of the BAM-COG. The author pointed very well to the risk of bias in the discussion section.
Preexisting EFAT
Among the 16 studies that focused on neuropsychological tests commonly used in the assessment of executive functions, three addressed cross-cultural adaptation of the Hotel Task (HT), the Melbourne Decision-Making Questionnaire (MDMQ), and the Barkley Deficits in Executive Functioning Scale (BDEFS) in Brazilian Portuguese (48,52,53). These studies are briefly described below.
Compared with the original version, the Brazilian-adapted version of the HT underwent several modifications to accommodate the cultural nuances of the Brazilian population. In both healthy adults and patients with traumatic brain injury, the adapted version exhibited no floor or ceiling effects and demonstrated significant score variability among participants. These findings suggest the preliminary suitability of the adapted HT for use within the Brazilian population, encompassing both healthy individuals and those with neurological conditions. Nonetheless, the authors emphasize the need for future research to establish evidence of reliability, construct and criterion validity, as well as sensitivity and specificity (52).
With respect to the Brazilian adaptation of the MDMQ, from the 22 statements of the original version, four items were excluded (2,4, 8, and 14) because of lower judges’ agreement (between 50% and 75%), and one (5) was substituted for a new item in the hypervigilance subscale. A confirmatory factor analysis (CFA) revealed a similar factor structure between the adapted and original MDMQ versions. Additionally, an acceptable overall internal consistency (Cronbach’s α = .824) was found, with α = .857, .853, .664, and .791 for the vigilance, buck-passing, hypervigilance, and procrastination subscales, respectively (48).
The Brazilian adapted version of the BDEFS showed semantic correspondence and satisfactory agreement with the original version, with 85 of the 89 items showing moderate to good correlation (53). Moreover, the five subscales were strongly correlated (ρ > .7) between the two versions. Additionally, the adapted version showed good internal consistency (Cronbach’s α = .961) and marginally moderate construct validity (ρ > .30) with the Brazilian versions of the Barrat Impulsiveness Scale (BIS-11) and the Adult Self-Report Scale (ASRS-18).
The results of the remaining 13 studies that focused on the psychometric analysis of existing EFATs are shown in Table 7. Among these studies, six investigated batteries frequently used in serial assessment (NIHT, CNS, CANTAB, and FAB); four studies investigated sets of numerous EF tests (Set 2: Simon, Visuoverbal and Visuospatial N-back, Letter-memory and the Number–letter tasks; and Set 3: Pro, Anti and Pro/Anti, Simon, Flanker, Forward and Backward Spatial, and 2-back tests); and three studies investigated commonly used EF tests (BIS, NoGo, and Stroop). Additionally, many reviewed studies tested gold standard subtests of other well-known assessment batteries, such as the Wechsler scales, D-KEFS (three studies), and BADS (see Table 5), which serve as reference standards to support the psychometric analysis of their investigated EFATs.
Table 7 Psychometric Properties of Studies on Existing EFATs
Study
|
Tool
|
Psychometric properties
|
Scarfo et al. (13)
|
Andersons’s pediatric model.
|
Predictive validity: good fitting model for AC (χ2(2) = 1.61, p = .447, RMSEA = 0.000, CFI = 1.000); CF (χ2(8) = 2.90, p = .940, RMSEA = 0.000, CFI = 1.000); IP (χ2(4) = 1.15, p = .886, RMSEA = 0.000, CFI = 1.000); and GS (χ2(8) = 7.22, p = .513, RMSEA = 0.000, CFI = 1.000).
|
Karlsen et al. (43)
|
CANTAB.
|
Test-retest reliability: low to moderate for most tests (r = .39 to .79). Acceptable test-retest reliability was observed just for SWM between errors (r = .71) and strategy (r = .79), AST percent correct (.75), and RVP (r = .75)
|
Ott et al. (44)
|
NIHTB-CB.
|
Construct validity: poor-to-adequate for attention and executive function, episodic memory, and processing speed domains (ICC = -.029 to .517); poor-to-good for working memory and motor dexterity domains, and the fluid composite (ICC = .215 to .801); adequate-to-good for the language (ICC = 0.408–0.829).
|
Pires et al. (46)
|
EF tests (Set 1)
|
CFA: good fitting model for a three-correlated factor model, with EF, VA, and PS factors (CFI = .992, RMSEA 90%CI = .018 (.000-.088), and SRMR = .65). PS is more related to EF (r = .64) than to VA (r = .41) and EF and VA showed no correlation with each other (r = .06).
|
White et al. (47)
|
EF tests (Set 2)
|
Test-retest reliability: low to excellent for RT (ICC = .34 to .93) in inhibition (Anti), response switching (Pro/Anti), selective attention (Simon and Flanker), and working memory (2-back)
Lower reliability was observed between the first two-time points (within the first testing-day), indicating practice effects.
|
Rijnen et al. (50)
|
CNS VS.
|
Test-retest Reliability: low to good for consistency (ICC = -40 to .89) and agreement (ICC = .17 to .88); Correlations coefficients: psychomotor speed (.88), processing speed (.81), RT (.78), CF (.74), complex attention (.55), verbal memory (.43), and visual memory (.41).
|
Soveri et al. (51)
|
EF tests (Set 3)
|
Test-retest reliability (RT and Accuracy): 1) RT: low reliability for Simon test (r = .611 to .636), low to acceptable for Visuoverbal (r = .647 to .854), Visuospatial (r = .701 to .823), Letter-Memory (r = .827 to .847), and Number-Letter tests (r = .428 to .773); 2) Accuracy: low reliability for Visuoverbal N-back (r = -.221 to .687) and acceptable for Letter-memory test (r = .827 to .847).
|
Malloy-Diniz et al. (55)
|
BIS-11BV.
|
Reliability of factor structure: Overall score (one-factor) presented acceptable internal consistency (Cronbach’s α = .790), with subscales ranging between .147 and .789; the two-factors (inhibition and nonplanning) presented the best internal consistency (Cronbach’s α = .789 and .618 for inhibition and nonplanning, respectively), considering cutoff value ≥ .600.
Normative: no significant impact of age, sex, or education.
|
Heaton et al. (56)
|
NIHTB-CB.
|
Test-retest reliability: acceptable to good internal consistency (Cronbach’s α = 0.84, .83, and .77) and strong to very strong test-retest correlations (r = .92, .86, and .90) for Crystallized, Fluid, and Total Cognition Composite scores, respectively.
Convergent construct validity: good correlations between NIHTB-CB and Gold Standard Measures on the Crystallized (r = .90), Fluid (r = .78), and Total (r = .89) Cognition Composite scores.
Discriminant construct validity: moderate correlations between NIHTB-CB Crystallized and Gold Standard Fluid Cognition Composite scores (r = .39), and low between NIHTB-CB Fluid and Gold Standard Crystallized Cognition Composite scores (r = .19). Between NIHTB-CB Crystallized and Fluid Cognition Composite scores was also low (r = .17).
|
Brunner et al. (58)
|
P3 No-Go wave.
|
Test-retest reliability: acceptable for amplitude (ICC > .75), and excellent for latency (ICC > .90).
|
Kang et al. (41)
|
Kaplan Stroop
|
Split-half reliability: excellent for Stroop A (r = .907) and B (r = .911), and acceptable for Stroop C (r = .797).
Construct validity: correlation coefficients between demographic variables and Stroop full-time scores did not differ significantly from similar correlations with half-time scores (Hotelling’s t tests, p > .05).
|
Beato et al. (59)
|
FABBV.
|
Normative: FAB scores significantly correlated with education (r = .47, p < .0001), and MMSE score (r = .39, p < .001). No significant correlations with age or gender (p = .13 and .09, respectively).
|
Dubois et al. (36)
|
FAB.
|
Interrater reliability: good (κ = .87), and good internal consistency (Cronbach’s α = .78).
Concurrent criterion validity: moderate association between FAB and WCST for perseverative errors (ρ = .68, p < .001), and strong for the number of criteria (ρ = .77, p < .001); strong correlation between FAB and DRS (ρ = .82, p < .001).
Discriminant construct validity: good between healthy participants and patients (analysis of covariance: F[1,131] 17. 24; p < 0.001).
|
Note. AC = attentional control; χ2 = Pearson's chi-square; RMSEA = Root Mean Square Error of Approximation; CFI = Comparative Fit Index; CF = cognitive flexibility; IP = information processing; GS = goal setting; CANTAB = Cambridge Neuropsychological Test Automated Battery; r = Pearson’s correlation coefficient; SWM = Spatial Working Memory; AST = Attention Switching Task; RVP = Rapid Visual Processing; NIHTB-CB = National Institute of Health Cognitive Battery; ICC = Intraclass Coefficient; EF = Executive Functions; Set 1: Working Memory, Tower, Divided Attention, Stroop, Verbal Fluency, Word List, Confrontation Naming tests, Coding and Telephone tests; CFA = Confirmatory Factor Analysis; VA = Verbal Abilities; PS = Processing Speed; CI = Confidence Interval; SRMR = Standardized Root Mean Square Residual; Set 2: Pro, Anti and Pro/Anti, Simon, Flanker, Forward and Backward Spatial, and 2-back tests; CNS VS = Central Nervous System Vital Signs; RT = Reaction Time; Set 3 = Simon, Visuoverbal and Visuospatial N-back, Letter-memory and the Number–letter tasks; BIS-11BV = Brazilian version of the Barratt Impulsiveness Scale; FABBV = Brazilian version of the Frontal Assessment Battery; MMSE = Mini-Mental State Examination; FAB = Frontal Assessment Battery; WCST = Wisconsin Card Sorting Test; ρ = Spearman’s correlation coefficients; DRS = Dementia Rating Scale.
With respect to the psychometric properties investigated in the reviewed studies, reliability (23 studies) was the most commonly evaluated, followed by validity (15 studies), factor analysis (7 studies), and normative data (6 studies). Test-retest reliability and construct validity are the most employed methods for reliability and validity analysis, respectively.
One critical aspect of psychometric analysis lies in its noteworthy susceptibility to bias (65). One kind of frequently observed bias in the reviewed studies was the sampling method, and most studies used convenience samples. Other equally relevant aspects were also observed: homogeneity of samples in terms of participants' educational level and little mention of the requirements for the parametric analysis (for example, some studies chose to use Pearson correlation with small samples, when Spearman’s correlation might have been more appropriate).
Furthermore, in the context of the observed methods for assessing reliability in the reviewed studies, biases may also have been raised from a) an exaggerated preference for using Cronbach's alpha as a measure of internal consistency without considering other more appropriate forms, such as λ4 and ωTOT (especially when different factorial loads are present), and b) the use of Parson’s or Spearman’s correlation when intraclass correlation might have been more appropriate (due to considering systematic error), among others posing a high risk that must be carefully addressed (66).
Finally, it is important to underscore the diverse approaches employed by authors in the reviewed studies regarding the establishment of thresholds for assessing reliability and validity (67–69). The absence of consensus on interpreting such thresholds, particularly concerning correlation indices (Pearson and Spearman) and interclass coefficients (ICCs), complicates efforts toward standardization (65,66).