Exploring scenarios in a two-branch clinical trial with binary outcome with a shiny based application
We have implemented a freely available shiny app to understand, plan, and simulate the effect of the various design factors that must be evaluated when computing the required sample size on a two-branch clinical trial with binary outcome [24] (Figure1)
The menu options of this application are planned to fulfill three objectives. First, in the "What’s your Goal?" option, the user can see the effect of changing equivalency margins, sample sizes, and proportion of the effect in each branch of the trail between alternative treatments (Figure 2). The interface also explains what conclusions can be drawn about the difference between treatment arms of the trial and its clinical significance. Second, in the “Plan a trial” option the user can plan an experiment by defining a scenario that suits the available knowledge on the specific problem. The app will compute the sample size required for different goals under the requirement of either statistical power or precision in estimation of the treatment's effect (Figures 3 and 4). Third, in the “Simulate trials” option, the user visualizes the simulation of different experiments which will facilitate understanding the expected results when performing an actual trial (Figures 5 and 6).
Understanding the different goals: Equivalence, non-inferiority, and superiority
Before discussing the actual computation of sample size, it is important to explore the concept of Equivalence Range ( ) and the interpretation of Superiority, Non-Inferiority, and Equivalence. Equivalence Range refers to the change on the difference outcomes’ probabilities (treatment effect) that we consider non-relevant. For instance, an equivalence range of 0.1 means that a difference of probability of are considered too low for concluding a practical difference between the treatments compared. This option is illustrated in Figure 2.
Once we fix an equivalent range , we can evaluate the result of the corresponding confidence interval for the difference of probabilities in the two groups (either treatment vs. control, or between a new treatment and a reference treatment). Basically, we have the following situations.
In the example shown in Figure 2, we consider the situation of a trial that results in a 41% of improvement in the treatment group (n=50) and a 20% in the control group (n=50). Those are sample proportions that are used for computing the confidence interval for the treatment effect (difference in population probabilities). In this example, the results suggest superiority but do not meet the threshold of clinical relevance. The user can play with different scenarios by changing the sample sizes and the proportions of improvement in both groups. For instance, it is important to understand that the interpretation will depend on the equivalence range considered. Also, you should explore the effect of increasing sample size in each scenario. After playing with this panel, the user is ready for planning an actual trial.
Sample size considerations in a two-branch clinical trial with a binary outcome
Determination of an appropriate sample size requires considering the following issues [23, 25]:
- Determining the effect of minimum desirable differences in probability for the trial: The required sample size depends on the actual difference in probabilities one sets as minimal to be considered of practical significance. In a scenario where the difference in the probabilities of the event between groups is low, it will be difficult or even impossible to prove that difference in a trial (see below).
- Understand the practical meaning of statistical power: Statistical power is the probability of detecting the minimum effect size that is considered meaningful in practical applications. Although a value of 0.8 is currently used, it is important to stress that this means that 20% of the trials will not be able to detect this effect. Can we increase power at will? Which are the implications?
- Importance of the minimum desirable effect size for clinically relevance ( difference of probabilities): The computed sample size will assure that you will attain the desired statistical power and be able to identify the treatment effect if that effect is equal or greater than .
- Effect of changing the ratio between the number of treatments and controls: The optimal ratio depends on the cost of a treatment and of a control.
- Effect of changing significance level: To reduce the probability of rejecting the null hypothesis when it is true, a lower significance level must be used. Increasing significance levels also increases the sample size required for attaining the desired power.
- Effect of choosing between a Superiority, Non-inferiority, or a Clinically relevant result? Decision on sample size will be different for each case.
In Figure 3 we show a typical calculation of sample size using our app (Menu option: “Plan a trial”). In this example, let’s suppose we assume a 60% of improvement for the control group, and 80% for the treatment group. If this is the actual case, the trial planned in Figure 4 will not show that the treatment is better. What are the requirements in that case? Are we going to reach a correct conclusion in a trial? To further analyze, we consider that the cost for a treatment is 10€, while the cost of a control is 1€. In the considered conditions, the app indicates that we need 62 subjects per group to be able to conclude superiority if the difference between the improvement probability among treatment and control is at least 0.8-0.6=0.2 and the minimum effect size to detect is 0.1. For non-inferiority, a sample size of 28 per group is enough, while we need 248 per group to show clinical superiority (a difference greater that the minimum of 0.1).
At this point, we have different costs for the trial (682€ for a Superiority trial, 308€ for a non-inferiority, and 2728€ for showing clinical superiority). Given that the cost of treatment is ten times the cost of a control, we may ask if we could consider less treatments while compensating by increasing the number of controls. In Figure 4 we show the optimization for the ratio of the number of treatments vs. the number of controls. In the considered conditions, the optimum ratio is around 0.27, reducing the cost for the different possible trials. For instance, concluding clinical superiority with a statistical power of 0.8 would require 141 treatments and 502 controls. Also, the app computes the necessary sample size if we aim to have a precise estimate of the treatment effectiveness. You can see that in this scenario precision requires a much higher sample size, with 1563 controls and 438 treatments.
Once the appropriate sample size based on statistical power has been determined, one may inquire about the expected outcomes of the trial. Simulation enables us to assess the potential results of a trial conducted under optimal conditions. Using this app, we can perform several simulations comparing typical results of many trials and compare them. In Figure 5 we show simulations with the specific conditions to conclude superiority. We have computed a sample size of 35 treatments and 126 controls (power of 0.8, minimum effect of 0.1 for the considered scenario). The simulated results show several interesting issues:
- After evaluating 2000 trials, 79.05% conclude superiority (that is a statistical power of about 80%)
- Each trial produces a relatively wide confidence interval. Thus, for a given trial we cannot obtain a precise estimation of the treatment effect. In that situation, we cannot conclude clinical superiority (only 36.75% of the trials show clinical superiority, hence a poor statistical power for this goal). We should increase the sample size to 141 treatments and 502 controls to achieve a 80% power for clinical superiority).
- The 97.05% of the trials show non-inferiority (we are using a much higher sample size than the minimum for an 80% statistical power in that case).
- It is important to stress that those are results of many trials. In practice, we would obtain one of those results. Thus, this makes especially relevant understanding the appropriate sample size after considering all the issues involved and minimizing the risk of obtaining misleading results.
The user can test the different conditions and decide which are the appropriate settings for a specific situation.
In the previous results, it is important to emphasize that statistical power does not assure an appropriate precision in effect estimation. In that example, the precision is and may be insufficient for practical conclusions. If we want to have a precision of we need to consider 1563 controls and 438 treatments. The difference is large, and the results are shown in Figure 6.
Exploring scenarios in a multi-arm clinical trial with a normal distributed response
We have developed a shiny app for simulating a multi-arm experiment and exploring several issues that should be considered in computing an appropriate sample size [26]. In this app, we include a brief introduction to the linear model and show the interpretation of the different sums of squares and the ANOVA table. The most important menu option is “Simulate a multi-arm trial”. This option has different panes (Figure 7.)
Let us plan an experiment with a control group (Arm 1) and two treatment groups (Arms 2 and 3). Assume that the control group has a mean biomarker concentration of 100 mg/ml, with a standard deviation of 3 mg/ml. We aim to compute what sample size reaches a statistical power of 0.8 for a significance level of 0.05. We will start with a sample size of 6 subjects per arm. As a demonstrative example, we require that the minimum effect in Arm 2 is an increase of 3 mg/ml in the mean, and of 5 mg/ml for Arm 3. In that situation, the settings in the app are indicated in Figure 7. The population distributions are shown in the panel “Population”, and the data of the simulated trial can be checked in the panels “Simulated data” and “Descriptive·. The panel “Check assumptions” Include a test for equality of variances and normality, two basic conditions for ANOVA.
In the “ANOVA results” option, the ANOVA table and the estimation of differences between groups using the Tukey method (Figure 8).
In that example, in the ANOVA table, we obtain a p-value of 0.0058 for the Arm effect. Therefore, we can conclude that our analysis suggests an effect due to the treatment. The pairwise confidence intervals for the differences in means among the arms (estimated effects) indicate that we should conclude a significant effect exists when comparing Arm 3 with Arm 1 (control), but no significant effect is observed in the other comparisons.
Let's compute the statistical power corresponding to the parameters defined and the required sample size to achieve a statistical power of 0.8. Additionally, we also compute the required sample size to achieve a precision of ±1 when comparing the effects of the different arms (Option: “Sample size, power and precision”, Figure 9).
With the considered initial sample size of 6, we had a statistical power of 0.648, which means that 64,8% of the samples that we will obtain using the parameters’ setting will identify the existence of effects. However, are we able to identify the true effect size? In that case, the effects of treatments with respect to controls are 3 and 5.
In the right panel, a computation of sample size for a power of 0.8 indicates that we need 8 subjects per group. After changing the sample size in the parameters’ settings, the ANOVA indicates a significant effect of Arms, but the pairwise estimation of the effects still fail for identifying the true effect (Figure 10).
With the new sample size, comparison 3-1 has a confidence interval between 0.64 and 7.95. This is too wide for practical conclusions. To compute an appropriate sample size, let’s consider a precision of 1. We obtain a required sample size of 92 (Figure 9). If we change this in the parameter’s settings, we can appreciate that we are estimating the effects with the required precision and discovering the true values (within the corresponding confidence intervals).
As we have greatly increased the sample size, now the statistical power is 1, which means that we will always conclude the existence of effects in the ANOVA. The difference is that now we will also obtain an appropriate estimate for the effects (Figure 11).
In a practical situation, this app allows exploring different scenarios when planning a parallel multi-arm clinical trial with a normally distributed response. These scenarios, similar to the previous case of a two-arm design, enable investigating the potential outcomes if a trial is conducted under such conditions. Here, we discuss some important issues that need to be considered when defining those scenarios:
- It is critical to have a rough idea of the standard deviation of your data. Using the app, you can explore the effect of different values and select the scenario that you consider most relevant to your problem. For instance, you can estimate the value of the standard deviation in a pilot sample of subjects. Remember that high values of standard deviation will imply large sample sizes.
- Statistical power is important. It addresses a different aspect of study design compared to the precision of effect estimation. The sample size required to achieve a given statistical power might not be sufficient for achieving precise estimation of treatment effects. Therefore, while statistical power ensures the likelihood of detecting an effect if it exists, achieving precision in estimating the size of that effect requires careful consideration of sample size and other factors. Both aspects are important in designing rigorous and informative clinical trials.
- Exploring potential effects and defining the minimum effect to detect is important. This involves setting the minimum effect size that you consider meaningful for your study. The computed sample size will ensure reliable results for detecting those minimum effects or larger. While you may not know the true effects beforehand, it's crucial to establish a practical threshold that you find important in the context of your research.
- Consider a mean value for the control group. While its specific value may not be critical for analysis, it can aid in interpretation. In practice, with a given standard deviation and a vector of effects, the results will remain the same regardless of the exact mean value of the control group.
- Fix the precision for the confidence interval in pairwise comparisons between arms. This is crucial when computing the required sample size. As a rule of thumb, the precision should be tighter than the minimum effect you aim to detect.