To develop the short-form scale, we extracted data from a 20-centered prospective cohort study (Uterine fibroids multicenter network information system: www.hifuctr.com) for a second analysis, which included patients who underwent self-selected hysterectomy, myomectomy, or high-intensity focused ultrasound (HIFU) therapy after being fully informed of the treatment options (the multicenter study was approved by a China-registered clinical trial ethics committee (ChiECRCT-2011034).[17] Details regarding the study design, data collection, and primary outcomes regarding the efficacy of the treatment have been published.[17] Prior to undergoing any study-related procedures at the clinical site, the patients completed the UFS-QoL questionnaire, the study short-form-36 (SF-36), and a brief sociodemographic questionnaire. Follow-up visits were scheduled at 6 and 12 months post-surgery, including complications, magnetic resonance imaging evaluation, overall treatment effect evaluation, the UFS-QoL questionnaire (for those who had undergone HIFU or myomectomy, because the instructions of the UFS-QoL questionnaire were based on the presence of uterine fibroids and menstrual periods), SF-36, and several health care utilization items were recorded.
Patient-reported Outcome Measures
Uterine fibroid symptom and quality of life questionnaire (UFS-QoL)
The UFS-QoL was developed from focus groups of women with uterine fibroids.[6, 11] The UFS-QoL questionnaire consists of 37 items, 8 of which assess the severity of symptoms (single domain) and 29 of which assess HRQL with six subscales (concern, activities, energy/mood, control, self-consciousness, and sexual function). All the responses were classified into five Likert-scale options. A higher score on the questionnaire's severity subscale indicates more severe symptoms, whereas a lower score on the HRQL subscales indicates poorer QoL.
Medical outcomes of the study short-form 36 (SF-36)
The Chinese SF-36 is a 36-item self-administered generic measure used to assess general health status. and validated its cross-cultural application, reality, and validity.[18-21] SF-36 consists of eight subscales: physical, functioning, role-physical, bodily pain, general health, vitality, social functioning, role-emotional, and mental health, as well as two composite scores: physical and mental. Individual subscale items were combined to form a subscale rating, which was then converted to a 0–100 scale.[22, 23] Higher QoL scores correspond to a four-week recall period.[23]
Item selection
We selected two items per UFS-QoL scale using a method that maximizes content validity.[13, 24] Due to the poor internal consistency and ability to detect self-consciousness in our previous study,[12] we did not retain the three items of self-consciousness. At the same time, because sexual functioning showed poor adaptability and high correlation between the two items, we only kept one item of the two items. We used the CTT and IRT to select two items for each of the five subscales. For the inconsistent items extracted from the two methods, we used responsiveness to choose items with higher ability to detect changes in clinical therapy. Therefore, we administered the 11-item version (UFS-QoL-11) to test its factor structure, criterion-related validity, and responsiveness, following the recommendations for short-form scale development. A flow chart of item selection is shown in Figure 1.
Based on the CTT to choose shortened items, we followed a method using regression analysis that maximizes the content validity of a 2-item scale. The first item had the highest correlation with the parent scale. The second item selected was the one with the highest beta weight of the remaining items in multiple regression, with the parent scale score as the dependent variable and individual items as predictor variables.
Item response theory offers detailed information at the item level, making it a powerful technique for developing short-form scales.[15] The responses of participants to each item were used to estimate their location on a latent trait (i.e., the level of symptoms). This latent trait is estimated using the responses on all individual items, and not, as in the CTT, via the sum score of all items. Within the IRT, every single item is defined by a discrimination parameter (alpha) and one or more location parameters or threshold parameters. The threshold parameters indicate the location on the scale of the latent continuum where the item best discriminates among individuals. The discrimination parameter reflects the true difference in theta per item and is comparable to factor loading.
To determine the candidate items for the short-form of the scale, we selected the items with two of the highest discriminatory parameters, as given by the IRT analyses. As the UFS-QoL has five ordered response options (1 = “not at all,” 2 = “A little bit,” 3 = “Somewhat,” 4 = “A great deal,” 5 = “A very great deal”), we used the graded response model to estimate the item response parameters. In addition, we inspected the item information curves to select items that covered a similar range of latent traits as the full scale. Finally, we compared the effect sizes of inconsistent items to choose the final items.
Statistical analysis
Spearman's correlation coefficient was used to determine the strength of the UFS-QoL-11 correlation test with the parents’ UFS-QoL. To examine construct validity in the UFS-QoL-11, we employed principal axis factor analysis with orthogonal rotation, which was used in conjunction with orthogonal rotation to determine the final number of factors based on their eigenvalues, congruence, and clinical significance. Each dataset was first analyzed for normality using the Kaiser–Meyer–Olkin (KMO) measure and Bartlett test of sphericity. A KMO value > 0.5 indicated acceptable structural validity. Cronbach's α was used to determine the internal consistency of the UFS-QoL-11. Cronbach's α coefficient, which ranges from 0 to 1, was used to determine the degree to which items on the subscales measure were related to the same concepts. A larger value indicated a smaller measurement error, which indicated a higher level of reliability. Criterion validity was assessed using correlations between baseline scores on the UFS-QoL-11 and SF-36, which were designed to measure general health.
We examined the evidence of known-group validity so that the UFS-QoL-11 could distinguish between clinically distinct groups by testing its ability to differentiate between patients based on health status (defined as SF-36-1). The ability to detect change was evaluated by comparing the 6-month pre-treatment and post-treatment scores to the 12-month scores at 6-month intervals. The effect size (change in mean score divided by baseline standard deviation)[25] and standardized response mean (change in mean score divided by change standard deviation) were computed. A value of 0.2 was considered to have a “small” effect, 0.5 a “moderate” one, and ≥ 0.8 a “large” effect.
Descriptive analyses (means and standard deviations [SD]) were performed using sociodemographic and clinical characteristics. Means of differences, 95% confidence intervals (CI), and statistical significance (P < 0.05) were tested using independent sample t-tests. The questionnaires were scored according to developers' instructions. Version 9.1.3 of SAS was used to conduct the analyses. All statistical tests were predetermined, and no missing data imputations were performed. All statistical tests were conducted with a fixed type I error probability of 0.05 and a two-tailed design.[22]