Illustrative Example
To illustrate power-calculations for response shift detection with SEM, we use – following [16] – the SF-36 health-related quality of life questionnaire as an example ([17]; see Figure 1). That is, the eight subscales of the SF-36 are modelled to be indicative of two underlying latent factors: general physical health and general mental health; measured at two occasions.
Appendices I – III include the lavaan syntax specification [18] of all H0 and H1 models that are used for the chi-square based power calculations of the SEM approach for response shift that are described below, including descriptive details on the model specification and model parameter values.
Step 1: Chi-square based power to detect misspecification of the measurement model
The first step of the SEM approach for the detection of response shift entails the specification of the measurement model. This measurement model specifies the measurement structure of the data, where the scores on the observed variables (e.g. scores on questionnaire items or, in this case, scores on the subscales of the SF-36) are related to one or more underlying latent variables (e.g. general mental and physical health) (see Figure 1). A correctly specified measurement model is important because the measurement model serves as a comparison for all subsequent models. When the measurement model is not correctly specified (e.g. the number of underlying factors is wrong), this will likely affect subsequent results with regards to detection of response shift effects [4]. Therefore, it is important to calculate the statistical power to detect possible misspecification of the measurement model.
The model fit of the measurement model is usually evaluated with the chi-square test of exact fit. The null-hypothesis (H0) is that the model fits the data exactly. When the p-value falls below the significance criterion (α), we reject H0 in favor of the alternative hypothesis. The alternative hypothesis (H1) is that the model does not fit the data exactly. Incorrectly rejecting H0 is called a Type I error; which is usually set at a .05 value. A type II error (β) is made when H0 should have been rejected, but was incorrectly retained. The power of a statistical test is the chance to correctly reject H0 (1-β; see Table 1).
Table 1. Statistical power for the three tests in steps 1-3 of the SEM approach to detect response shift
Reality
Statistical test
|
H0 = true
|
H1 = true
|
Reject H0
|
α (Type I error)
Step 1: Incorrectly reject measurement model
Step 2: Incorrectly reject no response shift model
Step 3: Incorrectly reject no response shift parameter
|
1-β (Power)
Step 1: Correctly reject measurement model
Step 2: Correctly reject no response shift model
Step 3: Correctly reject no response shift parameter
|
Not reject H0
|
1-α (Correct inference)
Step 1: Correctly retain measurement model
Step 2: Correctly retain no response shift model
Step 3: Correctly retain no response shift parameter
|
β (Type II error)
Step 1: Fail to reject misspecified measurement model
Step 2: Fail to reject no response shift model
Step 3: Fail to reject no response shift parameter
|
Notes: H0 = null-hypothesis, H1 = alternative hypothesis
Power calculations require the specification of H0 and H1. With a simple statistical test like a student t-test, H0 is usually zero (e.g. there is no difference between groups) and H1 is usually set at an effect-size value that is deemed plausible or minimally relevant (e.g. a mean difference according to rules of thumb of small, medium or large effects). Power calculations for the chi-square test of exact fit are based on the difference in chi-square distributions between H0 and H1, and therefore require the specification of both H0 and H1 models [19]. Following Oort [16], the H0 model for the SF-36 could be the measurement model as specified in Figure 1. This model works well as an illustration, because it has simple structure (i.e. each variable loads on only one underlying latent factor) and is therefore relatively easy to specify and interpret. The H1 model can be any alternative measurement model of the SF-36. Determining a plausible H1 model is complicated because model misspecification generally does not entail a specific effect of interest within the model. There thus exist many different options for the definition of H1, e.g. a one-factor model, a three-factor model, or a model with one or multiple cross-loadings. Moreover, the calculation of an effect size for H1 requires that the values for all model parameters in the H1 model are specified. It may thus take quite some deliberation on what the exact misspecification should entail. An approach that one could take is to first specify the model under H0, i.e. the model that the researcher thinks is the plausible model, including plausible values for all model parameters. Subsequently, one could think of a variation of the H0 model that includes one or more additional parameters for which – if these parameters are not zero – the H0 model should be rejected. For example, with regards to the measurement model of the SF-36 from our illustrative example, one could think of an alternative measurement model that includes additional loadings (i.e. cross-loadings) of the indicators GH, VT and/or SF that have been previously described in the SF-36 manual [17]. With regards to the value of these additional parameters, the recommendation would be to choose the minimum value that would be of interest. In general, specifying the values of model parameters in standardized metric is convenient because they can be interpreted according to general rules of thumb for representing small, medium, and large effects. For example, standardized factor loadings of .1, .3 and .5 can be interpreted as correlation coefficients and thus represent small, medium and large respectively [7]. Also, previous findings can be used to inform plausible model parameter values.
Specification of H0 and H1 for the Step 1 chi-square based power calculation. Using the illustrative example, the H0 model of the SF-36 is defined as depicted in Figure 1 (see also Appendix I, page 1). It is based on information of the 8 subscales of the SF-36 at baseline and follow-up; the number of unique elements in the variance-covariances matrix of the empirical data is thus 16*17/2=136. The H0 model contains the specification of 16 factor loadings, 4 underlying latent factor variances, 6 underlying latent factor covariances, 16 residual factor variances, and 8 residual factor covariances. Identification of the model requires that either the underlying latent factor variance or one factor loading for each latent factor is restricted to a fixed value [16], so that the total number of free parameters in the H0 model is 46 (see also Appendix I, page 3). Note that for reasons of conciseness the mean structure is not considered here.
The H1 model is defined as the H0 model with the addition of two medium-sized cross-loadings of the GH and VT subscales (see Figure 2). Note that there are multiple options for defining H1. This specific H1 was considered a plausible alternative model based on previous research that has found substantial cross-loadings in the measurement model of the SF-36 (e.g. [20-21]). We specified the parameter values in standardized form, where the values for the factor loadings are chosen to be .5 (i.e. of large size [7]) and the values of the variances of the residual factors are chosen so that the total variance of each observed variable is 1. Similarly, the variances of the underlying latent factors are standardized. This entails that also the values for associations between the residual factors and between the underlying factors can be interpreted as correlation coefficients. The additional cross-loadings in H1 were specified to be of medium size (i.e., .3 [7]). As choosing parameter values for all parameters in the H1 is arguably the most difficult part of chi-square based power calculations, we return to this issue in the discussion section.
Step 1 chi-square based power calculation with power4SEM. When both models are specified, and plausible values for all model parameters of H1 are provided, we can use power4SEM to calculate the chance to correctly reject H0. For reasons of conciseness, we will only describe what steps to take in order to arrive at the desired result. We will not go into (technical) details of the underlying calculations or required input values, for which the reader is referred to the tutorial paper of power4SEM [13] and/or the help files available under the question mark buttons on the webpage. In addition, Appendix I also includes a more detailed visual description of the required procedure. As a first step, insert the lavaan syntax of the H1 and H0 models in the dedicated areas from the “lavaan input” page. You will see a graphical display of both models at the right-hand side of the screen (see Figure 3). Use the default setting of N=200 for the “Intended sample size” box; when the researcher has information on the intended or acquired sample size for the proposed/performed study, one could inserted that specific number instead. Click on the green button “obtain NCP” at the top of the page.
Second, go to the “Chi-square test” page and insert the following values in the box “Input” on the upper left side of the screen: the noncentrality parameter (NCP) value obtained in the first step (i.e. 13.796), the degrees of freedom (Df) of the measurement model (i.e., the number of free statistics minus the number of free model parameters, in our illustrative example this is 136 – 46 = 90), and the alpha-value (α = .05). Click on the blue button “Calculate!”. The result of the power-calculation is now shown both numerically and graphically at the right-hand side of the screen (see Figure 4). That is, the statistical power to correctly reject our H0 model as specified in Figure 1, when in reality the true model includes two medium-sized cross-loadings, is .261. In other words, there is a 26.1% chance of correctly rejecting H0. A rather disappointing result considering that one generally wants to achieve a power of 80%.
Sample size needed to acquire sufficient power. An additional feature of power4SEM is that it can also be used to calculate the minimum sample size to achieve a desired power of 80%. If we fill in the required values in the box at the bottom left of the “Chi-square test” page, we find that for our illustrative example the minimum sample size needed is 560. In other words, to increase our confidence that the chi-square test of exact fit will reject our model in Figure 1 when it is misspecified (as defined by two medium-sized cross loadings), we should fit the model to data from at least 560 participants.
The illustrated chi-square based power calculations can thus be a valuable tool in two situations. First, it can be used as a helpful tool for studies in which the sample size is already determined or for studies that have already been completed, as it can provide confidence in the accurateness of the specified measurement model. In addition, and preferably, it can be helpful for sample-size planning at the stage of study design. A general drawback of chi-square based power calculations, however, is that it requires explicitly specified models with values for all model parameters. As an alternative, there is also the option to base power calculations for overall model fit evaluation on the root mean square error of approximation (RMSEA) fit index.
Alternative power-calculation for overall model fit evaluation: RMSEA-based power
Instead of relying on chi-square-based power for the power calculation to detect possible misfit in the measurement model, one can also use RMSEA-based power [22]. The RMSEA is an alternative fit-measure for overall model fit, where values of < .05 are indicative of ‘close fit’, < .08 of ‘acceptable fit’, and > .10 of ‘poor fit’ [23]. Because the RMSEA value is derived from the chi-square value we can also derive the chi-square distributions under H0 and H1 from an RMSEA value. That is, in order to calculate statistical power for overall model fit evaluation, we only need to specify the RMSEA-values of H0 and H1, instead of having to specify all model parameters in both models. So, for example, one can investigate the power to reject close fit (RMSEA value H0 = .05) when in the population there is not close fit (RMSEA value H1 = .08). This power calculation is similar to the chi-square-based power calculation in that it provides the power to correctly reject a misspecified measurement model. Another advantage of the RMSEA-based power calculation, is that we can also switch the direction of hypothesis testing so that we can calculate the power to reject H1 when H0 is true. This is an advantage because with SEM we usually assume that H0 is true. That is, we believe that the model that we specify under H0 is the true model and so we are not directly interested in the power to reject H0 when in fact H1 is true; but, instead, it would be more informative to know the power to reject H1 when H0 is true. So, for example, we can investigate the power to reject a model with not-close fit (RMSEA value H0 = .08) in favor of a model with close fit (RMSEA value H1 = .05), when there is ‘true’ close fit of the model. More stringently, following MacCallum et al. [22] one could calculate the power to reject a model with ‘not close fit’, using RMSEA H0 = .05 and RMSEA H1 is .01. This will give us the probability to correctly reject a model with RMSEA > .05 if the population RSMEA is .01. Different values may be chosen for H0 and H1, which will of course impact the calculated power. As a general recommendation, one could use the cut-off values that one uses to base a decision on whether the model does or does not fit well to the data.
Step 1 RMSEA-based power calculation with power4SEM. RMSEA-based power calculations are also available in the power4SEM app, under the “RMSEA” page. Here, we need to provide the RMSEA-values for H0 and H1. Suppose we calculate the power to reject close fit (RMSEA = .05) of the measurement model in Figure 1, when there is ‘true’ not-close fit in the population (RMSEA = .08). We also provide the intended sample size (N = 200), alpha value (.05), and number of degrees of freedom of the model of interest (df = 90). If we click on the red button “Calculate!” the result is now shown both numerically and graphically at the right-hand side of the screen (see Figure 5). When the model in reality shows not-close fit, the power to reject the hypothesis of close-fit is 0.937. If we reverse the RMSEA-values, we will see that the power to reject the hypothesis of not-close fit (RMSEA H1 = .08) when the model in the populations shows close fit (RMSEA H0 = .05) is 0.936.
Step 2: Chi-square based power to detect the overall presence of response shift
The second step in the SEM approach for response shift detection entails an omnibus test on the presence of response shift. The presence of response shift is indicated by a change in the pattern of factor loadings (reconceptualization), the value of factor loadings (reprioritization) or the values of intercepts (recalibration[1]). The omnibus test is performed by comparing the so-called ‘no response shift model’, i.e. a model in which all parameters that are associated with response shift are restricted to be equal across time, to the measurement model (in which all these parameters are free to vary across time). The chi-square values of both models can be compared using a chi-square difference test, where a significant p-value indicates that H0 (no response shift) should be rejected (see also Table 1). In other words, it indicates the overall presence of (any type of) response shift. Statistical power for this chi-square difference test will indicate the chance of correctly rejecting H0 (no response shift) when in reality response shift effects are present (see also Table 1). When statistical power is low, there is a high chance that the test will incorrectly indicate that there is no response shift. The difficulty for the power calculation is – similar to Step 1 – to define H1. Here, H1 refers to a model that includes indications of response shift, and one thus has to determine what the ‘overall presence of response shift’ looks like. That is, to determine the exact type, number, and size of possible response shift effects for which H0 should be rejected.
Specification of H0 and H1 for the Step 2 chi-square based power calculation. The H0 model that is used in power calculations for the omnibus test on response shift is the ‘no response shift model’ in which all factor loadings and intercepts are restricted to be equal across baseline and follow-up (see Figure 6 and Appendix II). The number of degrees of freedom for this model are 102 (see Appendix II for more details). The degrees of freedom for the chi-square difference test that is used to test for the overall presence of response shift is thus 102-90=12. The H1 model is specified the same as the H0 model, but includes some response shift effects. That is, the H1 model is defined by including differences in the pattern of factor loadings, values of factor loadings and/or intercepts across time. The choice on the type, number, and size of possible response shift effects to include in H1 is greatly facilitated when there exist a-priori hypotheses on the potential occurrence of response shift. Based on theory or prior research one may have an idea of what type (i.e. recalibration, reprioritization or reconceptualization), what number, and how large the possible response shift effects may be. For example, previous studies on response shift with the SF-36 indicated the presence of reconceptualization (GH subscale [24]), reprioritization (SF subscale [24], RP subscale [21]) and recalibration response shift (PF subscale [25], RP and BP subscales [24]). When there is no a-priori information available, the specification of a plausible H1 is more difficult. As a general recommendation, one could include the minimum number of response shift effects that would be of interest. As the response shift effects refer to targeted parameters, general accepted rules of thumb for the size of the effects can be used to specify small, medium or large effects respectively. The choice of H1 model specification in our illustrative example is not based on previous findings of (size of) effects, as the lack of context complicates using substantive considerations in our model specification. Therefore, in our illustrative example H1 is specified as a model that includes a total of three response shift effects, i.e. one medium-sized recalibration, reprioritization and reconceptualization effect respectively (see Figure 6 and Appendix II). Note that we now include also the mean structure, as the estimation of underlying factor means is now part of the modelling procedure.
Step 2 chi-square based power calculation with power4SEM. When both H0 and H1 models are specified, and plausible values for all model parameters of H1 are provided, we can use power4SEM to calculate the chance to correctly reject H0 of no response shift (see also Appendix II). First, the lavaan syntax of the H0 and H1 models are inserted into the designated input-boxes on the “lavaan input” page (see Figure 7). The result is obtained by clicking on the green button “Obtain NCP”. Second, on the “Chi-square test” page the obtained NCP-value (36.688), the Df of the chi-square difference test (12), and the appropriate alpha (.05) are provided as input to obtain the statistical power of the test. The result is shown on the right side of the page (see Figure 8), where the power to correctly reject H0 of no response shift is .994. Thus, when there exist three medium-sized response shifts in reality, the omnibus test for response shift is very likely to correctly reject the hypothesis of no response shift.
Step 3: Chi-square based power to detect specific response shift effects
The third step in the SEM approach for response shift detection includes specific tests for response shift effects. That is, the tenability of equality restrictions on model parameters associated with response shift are investigated one by one. Again, the chi-square difference test can be used to test the tenability of the equality restriction. The H0 of no response shift now refers to one specific response shift effect (see Table 1). When the p-value falls below the alpha-criterion the H0 of no response shift specific to the parameter is rejected. Sufficient statistical power is needed to ensure that when the specific response shift effect that is being evaluated exist, that there is a high chance that the chi-square difference test will detect it. If statistical power is low, there is high chance that response shift effects are missed.
Specification of H0 and H1 for the Step 3 chi-square based power calculation. The H0 model that is used in power calculations for tests on specific response shift effects is – again – the ‘no response shift model’ in which all factor loadings and intercepts are restricted to be equal across baseline and follow-up (see Figure 9 and Appendix III). The difference with the power calculations for the omnibus test for response shift is that the H1 model includes only one specific response shift effect. The degrees of freedom for the chi-square difference test that is used to test for the presence of a single response shift is 1 (instead of 12 for the omnibus test of response shift). Using the illustrative example, we specify three different H1 models for the detection of one medium-sized recalibration, reprioritization or reconceptualization response shift respectively (see Figure 8). In this situation there are thus three different power calculations associated with the chi-square test for specific response shift. Here, we elaborate on the power to detect a specific indication of reconceptualization response shift (but see Appendix III for syntaxes of all three power calculations), which is defined as a medium-sized cross-loading of VT at follow-up measurement (H1 model A in Figure 9).
Step 3 chi-square based power calculation with power4SEM. We use power4SEM to calculate the chance to correctly reject H0 of no response shift, in favor of H1 with one indication of a medium-sized reconceptualization response shift (see also Appendix III). The NCP value that is derived by inserting the H0 and H1 model syntaxes in the “lavaan input” page is 9.013 (see Figure 10). In combination with Df = 1 and α = .05 this results in a power of .851 (see Figure 11). That is, the chance that the H0 of no reconceptualization response shift of VT will be correctly rejected (when there is a medium-sized effect present in reality) is 85.1%. This is good news, as the calculated power falls above the desired power of 80%.
Note, that when the omnibus test of response shift is used in the same situation (i.e., when only one reconceptualization response shift is present in reality), the power to detect such an effect is reduced to 45.4% (see Appendix III for details). That is, the power to detect a single response shift effect will be higher for the chi-square test on a specific parameter (i.e., Step 3 of the SEM approach) than it will be for the omnibus chi-square test (i.e., Step 2 of the SEM approach). However, as there are many specific parameters that can be tested for the presence of response shift the increasing number of statistical tests performed on the same data will generally lead to an increased Type I error rate (see Table 1). There is thus a balance to be found between the protection against Type I errors with the omnibus test and the higher power to detect single indications of response shift of the specific test.