The objectives were to determine: 1) construct validity (structural properties, discriminant validity, known groups validity, measurement invariance, differential item functioning by country), 2) reliability (internal consistency and test re-test), 3) responsiveness, 4) acceptability (time to complete) and 5) interpretability.
Design
We applied the COSMIN taxonomy of best practices in tool development and validation[50] [51] and outcome measure guidance in palliative care[41, 52, 53]. The overview of our study methods is show in Fig. 1; we present methods and findings by phase of study. The overall construct being measured was symptoms and concerns among children and families facing life-limiting or life-threatening conditions.
Setting
Three clinical sites delivering palliative care (a children’s HIV outpatient service in Uganda, a teaching and referral hospital-one provincial hospital in Kenya, and a national children’s hospital in South Africa) in line with the WHO definition of paediatric palliative care[54] and ability to recruit at least 30 new patients per month to allow for timely study completion.
Inclusion/exclusion criteria
Inclusion: aged between birth and 17 years receiving care for a LLC, with parent or legal guardian present to consent to study participation.
Exclusion: We excluded children who were deemed by their clinician to be too ill or had cognitive impairments of a severity that precluded meaningful participation. Families were excluded if their child’s clinician felt the family member was either too unwell or distressed to take part. We defined a family caregiver as a family member who took care of the child for at least 50% of the time.
Recruitment and consent
Carers were approached by a member of the clinical team and informed of the study objectives and procedures. All participating children aged 8 years and above gave assent, and adult caregivers gave informed written consent for children to participate.
Data collection
Data collection occurred between February 2012 to November 2012. Study instruments, information and consent forms were forward and back translated from English[55] into local languages: Uganda (Luganda, Runyakore-Rukiga); Kenya (Kikuyu, Luo, Kiswahili); South Africa (Xhosa, Zulu, Pedi Sesotho, South Sotho). This was followed by the reconciliation of the two forward translations through dialogue and consultation with content experts.
Due to varying levels of respondent literacy, the research nurses read the questions aloud in all instances and recorded responses from children or carers. We trained all the research nurses on how to administer the study instruments and gave each nurse a copy of the standard operating procedures for administering the study tools. Patients could score the C-POS using a hand or verbal scale. In our earlier phases of developing the C-POS, children were asked about preferred scales and hand and verbal were most preferred and self-reporting children could interpret hand and verbal scales. Children aged 7 years and above were allowed to respond on their own if possible given evidence shows they can self-report on health and wellbeing[56, 57]. For test-retest, the C-POS was re-administered to sub-cohort of in-patients whose well-being the clinical team did not expect to change significantly in a period of 24 hours.
Measures
The C-POS
The C-POS addresses young people’s symptoms and concerns, drawing on the “total pain” construct that drives assessment and intervention in palliative care (i.e. physical and pyscho-social, spiritual, practical and emotional concerns, and needs of the family) (Appendix Fig. 2). Scores use Likert scales from 0 to 5. Questions 1–7 of the C-POS are directed at patients/children; these questions can be asked directly to children (either self-report or proxy) or proxies can be observer informants. The tool is self-reported for children aged 7 years and above[58] where literacy is a problem, staff can (read aloud the items and responses with respondents choosing the best response options, on their own). Proxies can respond where a child is unable because of age (e.g. the 0-6-year-old) or those with advanced disease, although research with adults demonstrated that family members and professionals’ ratings may differ from the patient[59]. For assessment, the C-POS was administered four times, and each point the completion time was recorded.
Paediatric Quality of Life Inventory (PedsQL)
The PedsQL[60], is a 23 item measure of quality of life designed to measure the core dimensions of health and role (school) functioning. Given that the PedsQL is generic function-based measure of quality of life we selected it to test convergence from C-POS measure of palliative care-related symptoms and concerns in children with life-limiting and life-threatening conditions.
Eastern Cooperative Oncology Group Scale (ECOG)
This single item measure of functional performance[61] ranges from 0 (fully active) to 4 (completely disabled). It is used as a proxy for disease progression and its effect on daily living abilities [61]. In line with guidance, we used ECOG with children aged 5 and above [62].
Socio-demographics
A study-specific questionnaire collected data on: child’s age, sex, first language, carer’s relationship to the child, household size, primary diagnosis, phase of illness, place of care (inpatient/outpatient), and reason for referral to palliative care.
Sample size
Sample size estimation was based on factor analysis and structural equation modelling (SEM) as they required the largest sample. Ideally 10 cases/observations per indicator variable are recommended for factor analysis[63], while sample sizes of 100–150 are recommended for SEM[64], based on Monte and Carlo simulations with a power of 80%, p value 0.05, a sample size of > 200 is recommended for robust weighted Least Squares or maximum likelihood for both binary or ordinal data[65, 66]. We therefore deemed a sample size > 200 for the child and proxy versions of the C-POS as adequate for modelling.
Data management
Data were entered into a pre-designed Epidata database and exported to Stata version 15 and Mplus 8.3 for analysis. The C-POS scores were explored for out-of-range scores and missing values, and items (1–4, & 8) were reversed so in all instances a score of “0” represented the least severity and “5” the most severe.
Analysis
1) Construct validity
We assessed for the following aspects of construct validity: a) structural properties; b) discriminant validity; c) known groups validity; d) measurement invariance; e) differential item functioning by country.
1a) Structural properties: theoretically the C-POS has two sub-scales, i.e., child items outcomes (n = 7) and family-items (n = 5). We conducted multi-level confirmatory analysis using the weighted least squares to confirm this factor structure[51]. This approach is recommended where a pre-existing theory exists as confirmatory factor analysis tests a hypothesis and is hence more robust[51]. We compared competing models to identify the model with the best fit using the flowing model fit coefficients; i)the chi-square and the associated degrees of freedom and the associated P values, the comparative fit index (CFI) > 0.9 is recommended, the Tucker-Lewis Index (TLI) - ≥ 0.9–0.95 and the RMSEA -recommended cut off is ≤ 0.05 but sometimes < .08 acceptable[67].
1b) Discriminant validity: We hypothesized that we would find a low-moderate correlation between C-POS and PEDSQL (i.e. <0.6). The following were calculated: PedsQL psychosocial subscale scores vs C-POS psychosocial items (felt happy + felt like playing), and PedsQL physical health subscale vs the C-POS physical items (pain + other symptoms items). We used the Spearman Rank Correlation (Spearman’s rho) to assess for the strength of correlation between the C-POS and PedsQL scores[68] (0 < 0.3 – low correlation; 0.3–0.5 – moderate correlation; >0.5 - strong correlation)[69].
1c) Known groups validity: We defined our known groups based on functional performance scores i.e. the principle that certain specified groups of patients are anticipated to score differently. Dividing the study population into known groups by functional performance status (for children aged 5 and above for whom it was possible to use the ECOG). We hypothesised that children with poor physical function would report more palliative care-related problems. For the self and proxy versions of the C-POS, we used analysis of variances to assess statistical significance of mean differences in the C-POS child item total scores across the levels of functional performance as measured by the ECOG.
1d) Measurement invariance: We fitted a configural model to explore the extent to which the C-POS self and proxy versions measure a similar construct, then a metric invariance model to determine the equivalence between the two versions (self and proxy report) at factor loading or structural level and then use of the scale or threshold level. At each stage, the model fit indices were examined to identify problematic items (i.e., those with large residual errors, and the level at which variance occurs). Sensitivity analyses were conducted to the explore the effect of excluding such items or allowing them to vary on the overall model fit.
1e) Differential item functioning by country setting: We used differential item functioning analysis to assess for cultural differences in the functionality of the C-POS items across Uganda, Kenya and South Africa. This was achieved using Multiple Indicator, Multiple Cause (MIMIC) modelling controlling for the effect of age. We fit three models: Model 1 self-report version of the C-POS child items; Model 2 Proxy report version of the C-POS child items; Model 3 Family items.
a) Internal consistency
As C-POS is a multidimensional measure, we did not assess the internal consistency of the tool as a whole, as results can be misleading[70]. Following confirmation of the hypothesised factor structure, internal consistency for the child and family subscales was assessed using the omega composite reliability coefficient. The latter statistic is robust in case of violation of the unequal factor loadings in a factorial model[71]. For items tapping a single construct, coefficients above 0.7 are acceptable and for multi-dimensional measures low coefficients of up to 0.5 are acceptable [72].
Further analysis was undertaken using item response theory to test the precision of the selected items and to identify areas for improvement along the latent construct continuum. We fitted a partial graded model and examined the internal consistency with additional information on the extent to which the various items contributed to our understanding of the variation in the latent construct. We inspected item and test information functions graphically to reflect how reliably the individual items and the test estimate the construct over the entire scale range. Values can be converted into an estimate of reliability (using this formula reliability ¼ 1 _ [1 /information]) extrapolating from Cronbach’s alpha rule of thumb of 0.70 to 0.90 for interpreting reliability; with the test function curve, these values correspond to acceptable item curve information scores of 3.3 to 10[63].
To identify potential areas for improving internal consistency, we also examined the functionality of the items; good items should contribute to our understanding of the variation in the latent construct between − 2 and + 2 standard deviations. Items with poor discrimination power (i.e., those with low information curves) could also be targeted for removal, replacement or rephrasing.
2b) Test-retest reliability
We used the weighted kappa coefficient to assess the level of agreement between the two time point scores. More subjective items will generally show relatively low reliability, and physical outcomes may be more consistent[72, 73]. Lower test-retest coefficients were expected for the child self-report version, as test-retest reliability is affected by child developmental age[74, 75]. Interpretation of coefficients was: less than 0.2 poor agreement; 0.21–0.40 slight; 0.41–0.60 moderate 0.61–0.8 good and 0.81 -1 very high [75]. We adopted a score of 0.3 for lowest acceptable level of agreement.
Responsiveness is defined as the ability of a measure to detect change[76]. A good instrument should be able to respond to changes in a patient’s condition but it is important to note that responsiveness contributes more to our understanding of the variance of the population being assessed and is not a property of an instrument in the strict sense[77]. Using Wilcoxon paired sign rank sum test, we compared differences in paired scores at the following time points; 1 vs 3, 2 vs 3, 3vs 4 and 1 vs 4. The initial test-retest measures were 24 hours apart, after which time points were a mean of three days apart. The test-retest data were excluded in the responsiveness analysis. We also used generalised linear regression models to assess for change in total C-POS scores for the child (self and proxy versions) and family subscales over time controlling for age and country setting. The latter approach is more robust as it uses all observations, as opposed to the selected paired [78].
4) Acceptability
We measured both time and extent of completion are indicators of measure acceptability[41]. For feasible clinical use in palliative care populations, measures should be short[79]. Although missing values commonly range from 4–18%, a threshold of 8% is acceptable[80].
5) Interpretability
Studies have consistently shown that there is little variability in the standard deviations derived from between-subject differences at baseline, change scores or difference between change in scores[81]. We therefore used the 0.5 standard deviation of baseline scores to compute the minimum important difference for the child and family total scores[82]. For the two versions of the C-POS, we computed the total C-POS scores for child and family items. To assess Cross -cultural validity we fitted three multiple indicator cause models (self-report child items; proxy child items and family subscale) controlling for age to explore differential item functioning by country setting. We set a stringent P value of 0.001 or < 0.001 as a requirement for statistical significance considering Bonferroni corrections for multiple testing and coefficients of at least 0.64. Items with differential item functioning would be included in the scale to avoid inflating the type II error.