Participants
In total, the present study included initial (“Time 1”) responses from the parents of \(N\) = 3,413 eligible children 0–71 months residing in Nebraska, USA. Additionally, both parent report and observational data also came from a 12–24 month follow-up subsample (“Time 2”) of \(N\) = 70 children who were 0–47 months old at Time 1. At Time 1, 53.3% of children were male, 55.9% were identified as white, non-Hispanic, the mean age was 34.6 months. At Time 2, 42.9% were male, 58.6% were identified as white, non-Hispanic, and the mean age was 37.8 months.
The complete set of inclusion and exclusion criteria for participant eligibility is diagramed in the study identification flowchart shown in Fig. 1. To be eligible for the present study, participants must be parents of children aged birth to five years, residing in Nebraska. “Parents” were defined as adults who were responsible for the child’s care at least 40 hours a week and identified themselves as biological, foster, or adoptive parents, or other relatives. At Time 1, 97.6% of respondents reported that they were the biological, foster, or adoptive parent of the child; at Time 2 all respondents (100%) identified as the biological, foster, or adoptive parent. Throughout this paper, we refer to respondents as “parents.” Because the present study focuses on the validation of the English version of the KMT, we excluded all responses to the Spanish version. The Time 2 sample was identified from the respondents to the Time 1 survey based on their willingness to be contacted again. All Time 2 sample participants completed the Time 1 survey before May 2021.
Studies have noted that financial incentives (like those we offered our participants) increase the likelihood of receiving fraudulent responses (c.f., Lawor et al, 2021). As part of a screening protocol, we excluded observations if (a) metadata information resulted in a “likely fraudulent” score from the rIP package (Waggoner et al, 2019) and the IP Hub database (https://iphub.info/), (b) the caregiver failed to accurately confirm the child’s birthdate, and (c) whether scores were above or below 5SD on the CREDI or ECDI (see below for a description of each). We refer to all initial administrations that met the eligibility criteria as the Time 1 sample (N = 3,413).
The duration between the initial administration at Time 1 and the follow-up at Time 2 averaged M = 16 months (range = 12–24). We excluded from analysis any observations which did not meet basal requirements for the direct assessments administered at Time 2. We refer to eligible response at follow-up as the Time 2 sample (N = 70).
Procedures
Data for Time 1 responses were collected using an online survey in between October 2020 and February 2023. Participants were given the option to take the survey in English or in Spanish. We offered parents a gift card ($20 to $40) to complete the survey. We recruited parents through healthcare providers, childcare and parenting support programs, and social media posts. We gave parents a link to an online questionnaire including several questions on family demographics and the child’s adverse childhood experiences (ACEs; described below), development, health, and home environment. Respondents could complete the survey using their mobile phone, tablet, or computer and took between 20 to 30 minutes to complete.
Measures
Kidsights Measurement Tool
The Kidsights Measurement Tool was constructed by first forming a candidate item bank by adopting items from four previously validated instruments each measuring normative aspects of children’s development (i.e., skills or behaviors that children acquire or exhibit as they age when undergoing healthy development) between 0–5 years. These four instruments included (1) the Global Scale of Early Development Short Form (GSED-SF; McCray et al., 2023), (2) the Caregiver Reported Early Development Instruments Long Form (CREDI-LF; McCoy et al. 2018), (3) the Early Childhood Development Index (ECDI2030; Cappa et al., 2021), and (4) Healthy and Ready to Learn (HRTL; Ghandour, 2019). We included only items that measured normative aspects of children’s development (i.e., skills or positive behaviors that are acquired or manifest as children age under healthy development), and we excluded items from these instruments that measure constructs such as problem behaviors or other indicators of psychosocial difficulties.
This process resulted in a candidate item bank of 223 items with 79 items unique to the GSED-SF, 23 items unique to the CREDI-LF, 7 items unique to the ECDI2030, and 49 items unique to the HRTL. Of the 223 items, 49 items were shared across one or more of the four contributing instruments (42 items were common between GSED-SF and CREDI; 7 items were common between the GSED-SF, CREDI-LF, and ECDI2030).
The 223 candidate items measure motor, cognition, language, and/or social/emotional constructs according to the published literature and existing documentation for the four instruments. Specifically, 71 items represented fine or gross motor development constructs (c.f., McCray, 2023; redacted, 2021; and Cappa et al., 2021) or physical development (c.f., Ghandour, 2019). Additionally, 82 represented cognitive or language development (c.f., McCray, 2023; redacted, 2021; and Cappa et al., 2021) or early learning skills (c.f., Ghandour, 2019). Lastly, 70 items measure social/emotional development (McCray, 2023; redacted, 2021; Cappa et al., 2021; Ghandour, 2019) including normative aspects of children’s self-regulation (Ghandour, 2019).
The 223 candidate items were then screened for sufficient variability in responses, including at least a 90% endorsement probability at birth. This process led to 19 items being removed from the candidate pool (see Supplemental Table 1). The result was a final set of 204 items spanning development from birth to age 5 years.
Concurrent and Predictive Measures
We administered previously validated direction-observation measures at Time 2 and parent-reported instruments at both Time 1 and Time 2.
Direct-observation instruments. Two direct assessments were administered: (a) The Bayley Scales of Infant and Toddler Development, Fourth Edition (Bayley-4; Bayley & Aylward, 2019) and (2) the Woodcock Johnson IV Early Cognitive and Academic Development (WJ IV ECAD; Schrank et al., 2018). The Bayley-4 and the WJ IV ECAD were only administered to follow-up subsample at Time 2 (N = 70). The WJ IV ECAD was used for children from 43–60 months of age at Time 2 (n = 33), and the Bayley-4 was used for children up to 42 months at Time 2 (n = 37).
Bayley-4. The Bayley-4 is validated to measure child development up to 42 months (e.g., Klein-Radukic, et al., 2023). The instrument is divided into items that capture development in the cognitive, language, motor, social/emotional and adaptive behavior domains through direct administration of activities, observation of the child, and questions to the caregiver (Bayley & Aylward, 2019). Scores are provided at domain level and the subtest level.
For the Bayley-4 training, assessors were required to complete a 12-hour online training hosted by the measure publisher. After completing the training, assessors submitted video recordings of administrations of the Bayley-4 or scheduled in-person observations with the research team’s Bayley-4 supervisor. The measure supervisor has several years of experience administering the Bayley Scales of Infant Development and evaluated each assessor for correct administration of item and scoring in order to certify each assessor as reliable on the Bayley-4.
WJ IV ECAD. The WJ IV ECAD is considered a measure of intellectual ability, academic skills and language, specifically oral expression for children 30 months to 6 years old (Schrank et al., 2018). The results of the WJ IV ECAD administrations result in a General Intellectual Ability Score, an Expressive Language Score, and scores by each test (LaForte et al., 2015). Although there are 10 tests in the WJ IV ECAD, only 7 of the tests were administered for the study. For this study, only the Bayley-4 scales cognitive, language and motor scales were administered to children up to 42 months.
Assessors followed a similar training and reliability process for the WJ IV ECAD as for the Bayley-4. They reviewed the WJ IV ECAD kit materials and submitted video of the administration of the WJ IV ECAD. The research team’s WJ-ECAD supervisor has experience administering the measure and reviewed the recording for correct administration and scoring before certifying the assessors as reliable on the measure.
Caregiver-reported instruments. The candidate Kidsights item pool included all items from the GSED-SF, CREDI-LF, ECDI2030, and HRTL. As a result, in administering the KMT, we effectively administered these four caregiver instruments concurrently. Scores from the GSED-SF are termed “D-scores” (Weber et al., 2019) and calculated using the dscore (van Buuren, Eekhout, & Huizing, 2022) R package. The CREDI Long Form results in an overall score of child development as well as subscale scores of motor, cognition, language and social/emotional development (Seiden et al, 2021; citation redacted). We calculated CREDI scores using the credi R package (https://github.com/marcus-waldman/credi). ECDI2030 scores were calculated using UNICEF’s (2023) provided R syntax file. HRTL scores were calculated by replicating the four-factor solution reported in Ghandour (2019) using the lavaan (Rosseel, 2012) R package and extracting factor scores. The four factors include a physical/motor factor, an early learning factor, a social/emotional factor, and a self-regulation factor.
Global Scales of Early Development Psychosocial Form. In addition to the KMT, we administered the Global Scales of Early Development (GSED-PF; citation redacted). The GSED-PF is currently undergoing validation and measures manifestations of early psychosocial stresss. The GSED-PF includes an overall score of children’s behaviors, as reflected through internalizing behaviors, externalizing behaviors, feeding problems, sleeping difficulties, and social competency problems.
Family- and Caregiver-Level Criterion Measures
Socioeconomic Measures. Using questions taken from the National Survey of Children’s Health, we asked parents to report their enrollment in governmental services and programs, educational attainment, and household income. Parents were asked to report if they were enrolled in 1) Medicaid; 2) Cash assistance from a government welfare program; 3) Free or reduced-cost breakfasts or lunches at school; 4) Food Stamps or Supplemental Nutrition Assistance Program (SNAP) benefits; 5) Benefits from the Woman, Infants, and Children (WIC) Program. For analysis, we created a single dummy variable (denoted GOVT) indicating whether the caregiver was enrolled in any of the above governmental programs (1-Yes, 0-No).
In the 2020 survey, we asked parents to report their 2019 household income in United States Dollars (USD). Likewise, in the 2022 survey, we asked parents to report their 2021 household income. To make household income on the same scale across years, we adjusted for inflation by converting to 1999 USD.
Parents could select from nine options in reporting their educational attainment: 1) 8th grade or less; 2) 9th-12th grade; 3) No diploma; 4) High School Graduate or GED Completed 5) Completed a vocational, trade, or business school program 6) Some College Credit, but No Degree 7) Associate Degree (AA, AS); 8) Bachelor’s Degree (BA, BS, AB); 9) Master’s Degree (MA, MS, MSW, MBA); 10) Doctorate (PhD, EdD) or Professional Degree (MD, DDS, DVM, JD). In the present study, we collapsed this information into four categories including 1) no high school (HS) diploma; 2) HS diploma; 3) Some college or an Associate’s degree (i.e., AA/AS); 4) Bachelor’s degree (i.e., BA/BS) or higher. We dummy coded caregiver educational attainment using parents with only high school education as the reference group.
Anxiety and Depressive Symptoms. We administered the Patient Health Questionnaire 2-item (PHQ-2; Kroenke et al., 2003) and the Generalized Anxiety Disorder 2-item (GAD-2; Löwe et al., 2008) to obtain caregiver self-reports of depressive and anxiety symptoms. Parents reported whether, over the last two weeks, they (1) had little pleasure or interest in doing things (i.e., indicator 1 of PHQ-2) and (2) were feeling down, depressed, and hopeless (i.e., indicator 2 of PHQ-2), (3) were feeling nervous, anxious, or on edge (i.e., indicator 1 of GAD-2), and (4) were not able to stop or control worrying (indicator 2 of GAD-2). Parents responded using a four-point Likert scale (i.e., “0-Not at all”; “1-Several days”; “2-More than half the days”; “3-Nearly every day”). For analysis, we created a depression and anxiety symptom total score by summing all four items.
Child-Level Criterion Measures.
Child-level criterion variables included information on the child’s sex (i.e., male or female), race, ethnicity, overall health status (as reported by the caregiver), and exposure to adverse childhood experiences (ACEs; Felitti et al., 1998). We adopted the survey items from the NSCH (Ghandour et al., 2018) in collecting this information.
Race and Ethnicity. Parents could select from up to 15 racial categories and one of five ethnicity categories. Racial category response options included: 1) American Indian or Alaska Native; 2) Asian Indian; 3) Black or African American; 4) Chinese; 5) Filipino; 6) Guamanian or Chamorro; 7) Japanese; 8) Korean; 9) Native Hawaiian; 10) other Asian; 11) other Pacific Islander; 12) Samoan; 13) Vietnamese; 14) White; or 15) Some other race. Ethnicity response options included 1) No, not of Hispanic, Latino, or Spanish origin; 2) Yes, Mexican, Mexican American, Chicano; 3) Yes, Puerto Rican; 4) Yes, Cuban; 5) Yes, another Hispanic, Latino, or Spanish origin.
For analysis, we combined racial and ethnicity into four major categories. These included: 1) White, non-Hispanic, 2) Black or African American, non-Hispanic, 3) Other (including two or more races), non-Hispanic, and 4) Hispanic.
Child’s General Health. We asked parents to rate their child’s general health (“In general, how would you describe this child’s health?”). Response options included: 1) Poor, 2) Fair, 3) Good, 4) Very Good, or 5) Excellent. For analysis, we applied dummy coding to indicate whether the child was reported to be in very good or excellent health (1-Yes, 0-No).
Adverse Childhood Experiences. We administered eight items measuring children’s ACEs all of which include “0-No” or “1-Yes” as response options: 1) caregiver divorce or separation, 2) caregiver death, 3) a household member with a drug or alcohol problem, 4) caregiver diagnosed with a mental illness, 5) exposure to violence in the community, 6) exposure to domestic violence, 7) parental incarceration, 8) racism. We determined the child’s count ACEs by summing the responses to the eight items.
Scaling and Scoring Procedures
Se fit the graded-response IRT model in (1)-(2) to the polytomous data.
where \(i\) indexes a child (with a latent score [i.e., ability] of \({\theta }_{i}\)), \(j\) indexes an item (with \({K}_{j}\) response options), and k indexes one of the responses options for the item. Model parameters in (1)-(2) include \({\alpha }_{j}\) (the item discrimination value), \({\delta }_{jk}\) (the difficulty value associated with response option \(k\) for item \(j\)), and the vector of latent regression coefficients \({\gamma }\). We fit the model using maximum marginal likelihood estimation (also referred to as full information maximum likelihood). Maximum likelihood estimators are gold standard approaches to treating missing data (Schafer & Graham, 2002). All model fitting occurred using the MIRT (Chalmers, 2012) package in R (R Core Team, 2022). We calculated Kidsights scores by summarizing the posterior distribution using the expected-a-posteriori (EAP) point estimate.
Analytic Plan
We followed the Standards in collecting and analyzing validity evidence. Evidence came from analyzing (1) test content, (2) relations with other variables (including criterion variables and scores from concurrent instrument), (3) the sensitivity of conclusions to threats from measurement non-invariance or item misfit, and (4) score precision as indicated by errors of measurement (i.e., reliability).
Evidence Based on Test Content
Using the domain assignments provided by the originating instruments, we designated items into one of three domains: (1) Motor/physical development, (2) Cognition or language development, or (3) Social/emotional development. To assess content coverage, we calculated the (average) domain composition of the administered items within yearly age categories. Evidence based on test content was evaluated by two subject matter experts to ensure adequate representation of items by developmental domain.
Evidence Based on Other Variables
In line with the Standards, we collected evidence that Kidsights scores correlate with other variables in the expected magnitude and direction. This includes: (a) convergent validity evidence with scores from concurrent instruments that measure equivalent (or highly similar) constructs as those directly intended to be measured by the KMT; (b) discriminant validity evidence with scores from concurrent measures measuring constructs not directly intended to be measured using the KMT; and (c) association of Kidsights scores with exogenous criterion variables known to be predictive of child development.
Convergent Validity Evidence. Convergent validity evidence from calculating part correlations (i.e., correlations after adjusting for the child’s age) of Kidsights score with (a) the Bayley-4 Cognition, Receptive and Expressive Communication, and Gross and Fine Motor domain and subtest level scores for children 42 months and younger; and (b) the WJ IV ECAD General Intellectual Ability- Early Development scores, Expressive Language cluster scores, as well as the Verbal Analogies, Sentence Repetition, and Rapid Picture Naming test activities for children 43–60 months.
Convergent validity evidence from caregiver-reported instruments came from part correlations with: (a) D-scores, (b) CREDI scores (overall scores, as well as motor, language, cognition, and social/emotional subscores), (c) ECDI2030 scores, and (d) motor/physical, early learning, and social/emotional factor scores from the HRTL.
We are aware of no published ideal or minimum threshold for a correlation to establish convergent validity evidence. Because a correlation value of approximately \(r=\).70 represents 50% of the variance explained, we used this value as an ideal threshold and a correlation of \(r=\).50 as a minimum threshold (i.e., 25% of the variance explained).
Discriminant Validity Evidence. Discriminant validity evidence from part correlations (i.e., correlations after controlling for children’s age) with scores from other instruments that reflect aspects of children’s behaviors which a not directly tied to normative aspects of children motor, cognitive, language, or social/emotional development. This included scores from the HRTL self-regulation factor and psychosocial problem scores from the GSED-PF. Because these scores reflect constructs not intended to be directly measured by the KMT, evidence comes from correlations smaller in magnitude (i.e., \(\left|r\right|\)<.50) and in the expected direction (i.e. negative part correlations with psychosocial problem scores, and positive correlations with self-regulation).
Predictive Validity Evidence. We assessed predictive validity evidence by studying part correlations of Kidsights scores at Time 1 with Bayley-4 and WJ-ECAD scores obtained 12–24 months (M = 16 months) later at Time 2. We took positive and statistically significant correlations as evidence that Kidsights scores predict future development and learning.
Criterion Associations. Following a model building procedure, we fit six multiple regression models (Models 1–6) of increasing complexity to evaluate criterion associations with variables known to be predictive of early childhood development. We controlled for child’s age using a fourth-order polynomial so that the regression equations took the general form
$$\begin{array}{c}{\theta }_{i}={\alpha }_{0}+{\alpha }_{1}{\text{A}\text{G}\text{E}}_{i}+{\alpha }_{2}{\text{A}\text{G}\text{E}}_{i}^{2}+{\alpha }_{3}{\text{A}\text{G}\text{E}}_{i}^{3}+{\alpha }_{4}{\text{A}\text{G}\text{E}}_{i}^{4}+{{x}}_{i}\beta +{ϵ}_{i}, {ϵ}_{i}~N(0,{\sigma }^{2}) \#\left(1\right)\end{array}$$
where \({\theta }_{i}\) is the Kidsights score for the ith child and the criterion variables are elements in \({{x}}_{i}\).
In Model 1, we included a dummy variable indicating the child’s sex as female (FEMALE; 0-Male; 1-Female) and tested whether females demonstrate higher average scores. In Model 2, we augmented Model 1 with an indicator variable for whether or not the caregiver reported the child was in very good or excellent overall health (HEALTHY; 0 – “Poor”, “Fair”, or “Good”; 1 – “Very Good” or “Excellent”). Building on Model 2, we specified Model 3 with three indicator variables of the caregiver’s education attainment; we chose attaining a high school diploma as the reference category resulting in three dummy variables indicating that the caregiver had (a) not attained a high school diploma (NOHS; 0-No, 1-Yes), (b) attended college or attained an Associate’s degree but had not earned a Bachelor’s degree (SOMECOLL; 0-No; 1-Yes), or (c) attained Bachelor’s degree or higher (denoted BS; 0-No, 1-Yes). We expected that increased educational attainment to be associated with higher average scores. In Model 4, we evaluated whether the caregiver’s reported enrollment in governmental services (GOVT; 0 – Not enrolled; 1 – Enrolled in SNAP, WIC, FRPL, a cash assistance program, or a governmental healthcare program) were negatively associated with scores. In Model 5, we included information on the caregiver’s race and ethnicity by specifying indicator variable for whether the caregiver was white, non-Hispanic (WHITE; 0-No, 1-Yes), Black, non-Hispanic (BLACK; 0-No, 1-Yes), or Hispanic (HISP; 0-No, 1-Yes); non-Hispanic parents identifying as two-or more races or who identified in a racial category other than black or white served as the reference category. After adjusting for differences in child’s overall health, the caregiver’s educational attainment, and enrollment in governmental services, we do not expect to find that race and ethnicity predict scores. Lastly, we included factor scores measuring the caregiver’s depression and anxiety symptoms (DEPANX) and the child’s adverse experiences (ACEs) in Model 6.
Missingness was most present in the GAD/PHQ-2 items (Min = 14.9%; Max = 15.3%), the survey question inquiring on the child’s general health (14.5%), and the child’s ACEs (with 0.50% not responding to at least one of the ACE questions). In response, we employed multiple imputation using the mice (van Buuren et al., 2015) R package to deal with missingness on the criterion variables. For all models, we evaluated the evidence from criterion associations by pooling results using Rubin’s rules, conducting pairwise t-tests, and interpreting the magnitude and substantive size of the coefficients. For models with multiple coefficients to be tested (i.e., Models 3, 5, & 6), we conducted a simultaneous testing of the coefficients using a multiple imputation F-test procedure (see, e.g., van Buuren, 2018)
Sensitivity Analysis of Possible Threats to Valid Inferences
We conducted sensitivity tests to assess whether measurement non-invariance or item misfit may threaten valid inferences of population differences in children’s overall development.
Measurement Non-Invariance. We conducted a sensitivity check to assess whether measurement non-invariance resulting from differential item functioning (DIF) threatens inferences about between-group differences in scores. We highlight here only the essential details of our procedure and refer the reader to the supplemental materials for technical details.
We first screened each item for DIF (uniform and non-uniform) across race, ethnicity, household income, caregiver’s educational attainment, and enrollment in governmental services. For items exhibiting statistically significant DIF, we created a sequence of items ordered from smallest- to largest- p-value. Iteratively we removed the item in the sequence, refit the 2PL model in (1) and (2), extracted new EAP scores, refit Model 6. We recorded whether there was evidence that the coefficients differenced in significance or substantive size compared to the estimates where no items are removed.
In other words, given the KMT is a population measure, the presence of DIF is only concerning if it leads to different conclusions about group differences. It is well established in IRT literature that the presence of DIF is not sufficient to conclude that inferences are invalid (see, e.g., Chalmers, 2014).
Item Misfit. In addition to measurement non-invariance, we assessed whether item misfit resulting from a poorly fitting assumed item response function threatens valid inference about population differences. We proceeded in this evaluation in three phases. First, using the guidelines provided by Maydeu-Olivares (2014), we identified poorly fitting items as those with root mean square error (RMSEA) statistics greater than .08. For each of the identified items, we specified an item response model that employs a third-order monotonic polynomial to relax the traditional linearity assumption. We then compared scores from the model with monotonic polynomials specified for misfitting items to the original model in (1) and (2). Controlling for children’s age, we considered part correlations less than .95 as evidence that conclusions risk being sensitive to item misfit.
Reliability and Errors of Measurement
We assessed the precision of scores in two ways. First, we calculate the marginal reliability statistic (\({r}_{X{X}^{{\prime }}}\)) proposed by Thissen and Wainer using the standard errors for the EAP estimates. Although the \({r}_{X{X}^{{\prime }}}\) is a valid measure of reliability, as a marginal statistic it overestimates the reliability when a child’s score is only to be compared to scores from their peers. To evaluate the precision of scores at conditional on a child’s age, we fit a generalized additive model for location scale and shape (GAMLSS; Rigby & Stasinopoulos, 2005) to estimate age-conditional variances in EAP scores. Using the within age variance of EAP scores and the conditional standard error of measurement (CSEM), we then calculated the expected conditional reliability value (\({r}_{X{X}^{{\prime }}|{\text{A}\text{G}\text{E}}_{i}}\)) for each child,
$$\begin{array}{c}{r}_{X{X}^{{\prime }}|{\text{A}\text{G}\text{E}}_{i}}=1-\frac{{\text{C}\widehat{\text{S}\text{E}}\text{M}}_{i}^{2}}{\text{V}\widehat{\text{a}}\text{r}\left\{{\widehat{\theta }}^{\text{E}\text{A}\text{P}}|{\text{A}\text{G}\text{E}}_{i}\right\}}, \#\left(2\right)\end{array}$$
where \({\widehat{\theta }}^{\text{E}\text{A}\text{P}}\) is the EAP score. We evaluated the average \({r}_{X{X}^{{\prime }}|{\text{A}\text{G}\text{E}}_{i}}\) values calculated in the previous step, in addition to evaluating the expected reliability \({r}_{X{X}^{{\prime }}|{\text{A}\text{G}\text{E}}_{i}}\) at each age. We used \({r}_{X{X}^{{\prime }}|{\text{A}\text{G}\text{E}}_{i}}=\) .80 as a cutoff for the minimal reliability at each age. Such a reliability value may be below traditional guidelines for individual assessments (i.e., when the conclusions are drawn regarding individuals). However, population measures like the KMT are intended to produce statistics that aggregate across individual scores, thereby washing out measurement error across individual score.