Deciding the Appropriate Sample Size for Clinical Trials: A complex interplay between power, effect size, and cost

doi:10.21203/rs.3.rs-4728662/v1

Download PDF

Research Article

Deciding the Appropriate Sample Size for Clinical Trials: A complex interplay between power, effect size, and cost

https://doi.org/10.21203/rs.3.rs-4728662/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Background: Sample size is a key factor in planning a clinical trial. Decisions regarding sample size are typically based on ensuring the statistical power of the test of interest. However, this does not always guarantee a precise estimate of the treatment effect. It is important to understand the distinction between these two aspects of a trial.

Methods: Although many computational tools exist for calculating sample size, researchers do not always fully grasp the various issues that must be considered before making a final decision. We propose using simulations to assist in this process. By doing so, researchers can explore different scenarios and better understand the distinction between statistical power and precision in estimating treatment effects.

Results: We developed two user-friendly applications using the Shiny package in R. To achieve our goals, we focused on two basic designs: (i) two-arm clinical trials with a binary outcome and (ii) multi-arm clinical trials with a normally distributed outcome. These applications facilitate understanding the selection of sample size and highlight the practical limitations of making decisions based solely on statistical power.

Conclusion: Simulation is a useful tool for complementing sample size computation and understanding the possible results associated with that decision. While statistical power is an important concept, decisions on sample size should also consider the precision in estimating treatment effects.

Statistical power

Clinical trial

Confidence interval precision

Sample size

Shiny app.

Clinical trials are crucial for medical progress, providing the scientific foundation for the development of new treatments and advancements in medical sciences [1, 2]. The validity of this knowledge largely relies on the robustness of the trial design, where the use of an appropriate sample size plays a critical role [3, 4]. Therefore, a thorough understanding of sample size calculation is essential for successfully planning a clinical trial. Using an appropriate sample size is vital to guarantee that a study possesses sufficient statistical power to detect clinically significant differences, thereby preventing the financial and ethical costs associated with a study that falls short of its intended objectives [5]. This work aims to analyze the interplay between sample size, statistical power, and the precision of effect estimation through interactive tools that facilitate appropriate decision-making in trial planning.

Statistical power is the probability of detecting a true treatment effect. Thus, increasing statistical power is a major concern when preparing projects in biomedical research. A clinical trial with low statistical power has a high probability of missing effects that could benefit patients' health, leading to erroneous conclusions. Sample size is critical for ensuring adequate power. Using an inappropriate sample size can result in resource losses and ethical problems, including exposing more participants than necessary to experimental treatments or placebos [6, 7].

In addition to achieving sufficient statistical power, the study design and sample size should also guarantee that treatment effects are estimated with appropriate precision [8–13]. In addition, in many cases a final decision on sample size will also involve considering the available budget [14]. For instance, consideration of optimum ratios between the number of treatments and controls can help reducing the overall cost of a trial if the tested treatment is more expensive compared to the control (placebo).

The precision in estimating the effect size is determined by achieving a specified width of the confidence intervals. Poor precision indicates the inability to draw reliable conclusions about the treatment effect, even if statistical significance is demonstrated [9, 10, 15]. In general, small sample sizes lead to wider confidence intervals, which reduce the reliability of estimated effects and may hinder the clinical application of trial results [16]. Therefore, achieving a careful balance is necessary, as the precision of estimates is not only a statistical concern but also crucial for guiding treatment decisions and patient care.

Several web-based user-friendly tools are available for computing sample size and power [17]. G*power [18]is one of the most used, covering various tests and calculating statistical power and effect sizes. In a more sophisticated context, R packages that require programming knowledge are also available to perform the same calculations. Some examples of these packages are listed in Table 1. These packages offer a broad array of functions for sample size determination, primarily emphasizing statistical power. Each provides customized solutions for diverse clinical trial designs, from basic to advanced, including adaptive trials. Proficiently utilizing any of these R packages necessitates expertise in both statistics and R programming. For non-experts, mastering these tools entails a steep and prolonged learning curve.

Table 1. R packages and shiny apps useful for computing sample size in clinical trials

R packages and shiny apps	URL
*Gpower**	https://www.psychologie.hhu.de/arbeitsgruppen/allgemeine-psychologie-und-arbeitspsychologie/gpower
pwr	https://rdocumentation.org/packages/pwr/versions/1.3-0
epiR	https://CRAN.R-project.org/package=epiR
TrialSize	https://CRAN.R-project.org/package=TrialSize
PowerTOST	https://CRAN.R-project.org/package=PowerTOST
gsDesign	https://CRAN.R-project.org/package=gsDesign
MCPMod	https://CRAN.R-project.org/package=MCPMod
precisely	https://github.com/malcolmbarrett/precisely
Designing Experiments And Analyzing Data	https://designingexperiments.com/shiny-r-web-apps/
pwrss	https://pwrss.shinyapps.io/index/
granmo	https://www.imim.cat/media/upload/arxius/granmo/granmo_v704.html

Interactive apps based on the Shiny package [19] enable the creation of user-friendly interfaces that implement R-package functions in an intuitive manner. They facilitate easy and efficient use of these packages by non-experts (see Table 1). An excellent example is provided by the precisely app, which offers tools for computing the required sample size of measures such as Risk Ratio, Odds Ratio, Risk difference, etc., at different precision levels. Designing Experiments A¡and Analyzing Data provides many alternative apps that are specific for calculating various input variables for different clinical trial designs. Finally, the app pwrss also provides a user-friendly interface for computing sample size and power in a number of basic designs.

Those packages and apps primarily focus on computing sample size, power, and confidence interval precision, but they do not provide the capability to simulate expected results in various scenarios and compare outcomes. In practice, researchers often face a choice between two distinct goals: (i) demonstrating the presence of a treatment effect (which necessitates achieving statistical power and defining a clinically meaningful effect size), and (ii) accurately estimating the true treatment effect (which requires achieving precision). To contribute to shed light on both these issues, we developed two user-friendly apps programmed with the Shiny library in R [19] for simulating alternative scenarios and for understanding and facilitating the choice of appropriate sample sizes considering both statistical power and precision. To facilitate discussion of these issues, we will focus on two fundamental types of clinical trials commonly used: (i) A two-arm trial with a binary outcome, and (ii) a multi-arm trial with a normally distributed outcome. We demonstrate the importance of using simulated experiments to concurrently address the challenges associated with determining an appropriate sample size. This approach also highlights that sample sizes calculated to achieve a specified statistical power may not be adequate for achieving precise estimates of treatment effects [8].

Statistical power as a design goal

Statistical power is defined as the probability of correctly rejecting a false null hypothesis. In the context of a clinical trial, it is the likelihood that a study will detect an effect when there is an actual effect to be detected. High statistical power means that the trial has a high probability of finding a true effect. A clinical trial with inadequate statistical power may fail to detect a potentially beneficial treatment effect, leading to the erroneous conclusion that the treatment is ineffective [13]. This not only misguides future research but can also have significant implications for patient care and policymaking.

Several key factors influence the statistical power: (i) Effect Size: The minimum magnitude of the effect that you want to be able of detect. Larger effects are easier to detect, thus requiring smaller sample sizes to achieve the same power, (ii) Sample Size: Larger sample sizes increase the power of a study, reducing the impact of random variation. (iii) Significance Level (α): The threshold for rejecting the null hypothesis. A lower α (such as 0.01 instead of 0.05) requires stronger evidence to reject the null hypothesis, which reduces power, and (iv) Variability in the Population: Greater variability within data, for instance due to low accuracy in measurements or to heterogeneity in the population, decreases power. Trials with less variability (more precise measurements, less experimental errors, etc.) have higher power.

Before initiating a trial, a desired statistical power, typically set at 80% or 90%, must be established during the design phase [20]. Once determined, this statistical power is used to compute the necessary sample size. Calculating the sample size also involves defining the minimum clinically relevant effect size and the desired significance level [11–13, 21, 22].

Precision in Estimating Treatment Effects in Clinical Trials as a design goal

Precision is often defined as the width of confidence intervals for an effect[23]. Essentially, a confidence interval identifies parameter values that are consistent with observed data. Statistically, we anticipate that the true parameter value will fall within the interval range in most samples, reflecting the confidence level. Thus, for a given sample, we can be confident that it accurately identifies an interval containing the true parameter value. Narrower confidence intervals reduce uncertainty about the parameter's value. Precision is critical in clinical trials because it directly informs the clinical decision-making process, aiding healthcare professionals in interpreting treatment outcomes for patients.

As in the case of statistical power, several factors can influence the precision of effect estimates in clinical trials: (i) Sample Size: Larger sample sizes provide more precise estimates as they reduce the impact of random variation in the data, (ii) Variability in the Population: High variability in participant responses or careless measurements can decrease precision. Trials involving populations with less variability in response to treatment tend to provide more precise estimates. (iii) Measurement Quality: The accuracy and consistency of the methods used to measure outcomes play a significant role in precision. Higher quality measurement tools and techniques typically yield more precise estimates. (iv) Study Design: Certain study designs, such as crossover designs or matched case-control studies, can increase precision by reducing variability or by controlling for confounding variables.

Two-branch clinical trial with binary outcome

A Two-branch clinical trial with binary outcome represent a basic design in clinical research, where participants are randomly assigned to one of two groups: typically, a treatment group (i.e. the treatment we want evaluate), and a reference group (either a reference treatment or a placebo). The outcome of interest in these trials is binary, meaning it has two possible outcomes, often denoted as 'success' or 'failure', 'effective' or 'ineffective', 'improved' or 'not improved', etc. In these trials, the primary focus is to compute the proportions of success (or failure) between the two groups and evaluate if the data can support the existence of differences on the population probabilities.

Two-branch trials with binary outcomes have the advantage of simplicity and clarity in interpreting results. However, they also have limitations. Binary outcomes may oversimplify complex clinical situations or fail to capture the range of responses to a treatment. There's also a risk of missing subtle but clinically significant effects if the binary outcome is not appropriately chosen.

When using this design, we must carefully identify which is the primary goal in comparing the two groups. These are the alternative cases: (i) Test Equivalence: Equivalence trials are designed to demonstrate that two treatments, typically a new treatment and an existing standard, have no clinically significant difference in their effects. These trials are crucial when the new treatment offers other benefits, such as reduced cost, fewer side effects, or improved patient compliance. In two-branch trials with binary outcomes, the goal is to show that the success rates in both groups fall within a pre-specified equivalence margin, which represents a range of clinically acceptable differences, (ii) Test Superiority: Superiority trials are perhaps the most common type of clinical trial. They aim to demonstrate that one treatment is better than another (often a placebo or standard treatment). In the context of two-branch trials with binary outcomes, superiority is demonstrated when the success rate in the treatment group is significantly higher than in the control group, beyond what could be attributed to chance alone. These trials are fundamental in establishing new treatments as effective, (iii) Test Clinical Superiority: Beyond showing superiority, in many cases the goal is establishing that the new treatment improves the effect of the old treatment above a certain threshold, making it clinically relevant. In that case, the minimum increment in effectivity must be fixed in advance, and (iv) Test Non-Inferiority: Non-inferiority trials are designed to show that a new treatment is not worse than an existing treatment by more than a specified margin. These trials are important when the new treatment may offer other advantages, such as safety, cost-effectiveness, or ease of use. In two-branch trials with binary outcomes, demonstrating non-inferiority involves showing that the success rate of the new treatment is not significantly lower than that of the existing treatment, within a predefined margin.

The decision to conduct equivalence, superiority, or non-inferiority trials should be based on the research question, ethical considerations, and the clinical relevance of the treatments being compared. Setting appropriate margins for equivalence (equivalence range) or non-inferiority is a critical aspect on designing this kind of trials. These margins must be clinically justified and should be not too small to be meaningless or too large to be unachievable. Results from these trials must be interpreted carefully. For instance, a finding of non-inferiority does not imply equivalence, and failing to demonstrate superiority does not necessarily mean the treatments are equivalent.

Understanding the distinctions and appropriate applications of equivalence, superiority, and non-inferiority trials is crucial in the context of clinical trials. Each type addresses different research questions and offers unique insights into the comparative effectiveness of treatments. Proper design, execution, and interpretation of these trials are key to advancing medical knowledge and improving patient care.

Multi-arms clinical trial with a normal distributed response.

In a multi-arms fixed factor clinical trial study participants are recruited and randomized into different treatments (arms). Each treatment arm receives a specific intervention or treatment protocol. As a typical scenario, we will consider the case of an outcome variable (treatment response) that follows a normal distribution and use analysis of variance (ANOVA) to evaluate the results.

The goal of such a trial is to determine whether there are statistically significant differences in the outcomes among the treatment arms, and to identify which intervention is the most effective or safe for the target population. This type of trial design is commonly used in clinical research to compare multiple treatment options and make informed decisions about which ones should be pursued further or recommended for clinical practice.

Exploring scenarios in a two-branch clinical trial with binary outcome with a shiny based application

We have implemented a freely available shiny app to understand, plan, and simulate the effect of the various design factors that must be evaluated when computing the required sample size on a two-branch clinical trial with binary outcome [24] (Figure1)

The menu options of this application are planned to fulfill three objectives. First, in the "What’s your Goal?" option, the user can see the effect of changing equivalency margins, sample sizes, and proportion of the effect in each branch of the trail between alternative treatments (Figure 2). The interface also explains what conclusions can be drawn about the difference between treatment arms of the trial and its clinical significance. Second, in the “Plan a trial” option the user can plan an experiment by defining a scenario that suits the available knowledge on the specific problem. The app will compute the sample size required for different goals under the requirement of either statistical power or precision in estimation of the treatment's effect (Figures 3 and 4). Third, in the “Simulate trials” option, the user visualizes the simulation of different experiments which will facilitate understanding the expected results when performing an actual trial (Figures 5 and 6).

Understanding the different goals: Equivalence, non-inferiority, and superiority

Before discussing the actual computation of sample size, it is important to explore the concept of Equivalence Range ( ) and the interpretation of Superiority, Non-Inferiority, and Equivalence. Equivalence Range refers to the change on the difference outcomes’ probabilities (treatment effect) that we consider non-relevant. For instance, an equivalence range of 0.1 means that a difference of probability of are considered too low for concluding a practical difference between the treatments compared. This option is illustrated in Figure 2.

Once we fix an equivalent range , we can evaluate the result of the corresponding confidence interval for the difference of probabilities in the two groups (either treatment vs. control, or between a new treatment and a reference treatment). Basically, we have the following situations.

In the example shown in Figure 2, we consider the situation of a trial that results in a 41% of improvement in the treatment group (n=50) and a 20% in the control group (n=50). Those are sample proportions that are used for computing the confidence interval for the treatment effect (difference in population probabilities). In this example, the results suggest superiority but do not meet the threshold of clinical relevance. The user can play with different scenarios by changing the sample sizes and the proportions of improvement in both groups. For instance, it is important to understand that the interpretation will depend on the equivalence range considered. Also, you should explore the effect of increasing sample size in each scenario. After playing with this panel, the user is ready for planning an actual trial.

Sample size considerations in a two-branch clinical trial with a binary outcome

Determination of an appropriate sample size requires considering the following issues [23, 25]:

Determining the effect of minimum desirable differences in probability for the trial: The required sample size depends on the actual difference in probabilities one sets as minimal to be considered of practical significance. In a scenario where the difference in the probabilities of the event between groups is low, it will be difficult or even impossible to prove that difference in a trial (see below).
Understand the practical meaning of statistical power: Statistical power is the probability of detecting the minimum effect size that is considered meaningful in practical applications. Although a value of 0.8 is currently used, it is important to stress that this means that 20% of the trials will not be able to detect this effect. Can we increase power at will? Which are the implications?
Importance of the minimum desirable effect size for clinically relevance ( difference of probabilities): The computed sample size will assure that you will attain the desired statistical power and be able to identify the treatment effect if that effect is equal or greater than .
Effect of changing the ratio between the number of treatments and controls: The optimal ratio depends on the cost of a treatment and of a control.
Effect of changing significance level: To reduce the probability of rejecting the null hypothesis when it is true, a lower significance level must be used. Increasing significance levels also increases the sample size required for attaining the desired power.
Effect of choosing between a Superiority, Non-inferiority, or a Clinically relevant result? Decision on sample size will be different for each case.

In Figure 3 we show a typical calculation of sample size using our app (Menu option: “Plan a trial”). In this example, let’s suppose we assume a 60% of improvement for the control group, and 80% for the treatment group. If this is the actual case, the trial planned in Figure 4 will not show that the treatment is better. What are the requirements in that case? Are we going to reach a correct conclusion in a trial? To further analyze, we consider that the cost for a treatment is 10€, while the cost of a control is 1€. In the considered conditions, the app indicates that we need 62 subjects per group to be able to conclude superiority if the difference between the improvement probability among treatment and control is at least 0.8-0.6=0.2 and the minimum effect size to detect is 0.1. For non-inferiority, a sample size of 28 per group is enough, while we need 248 per group to show clinical superiority (a difference greater that the minimum of 0.1).

At this point, we have different costs for the trial (682€ for a Superiority trial, 308€ for a non-inferiority, and 2728€ for showing clinical superiority). Given that the cost of treatment is ten times the cost of a control, we may ask if we could consider less treatments while compensating by increasing the number of controls. In Figure 4 we show the optimization for the ratio of the number of treatments vs. the number of controls. In the considered conditions, the optimum ratio is around 0.27, reducing the cost for the different possible trials. For instance, concluding clinical superiority with a statistical power of 0.8 would require 141 treatments and 502 controls. Also, the app computes the necessary sample size if we aim to have a precise estimate of the treatment effectiveness. You can see that in this scenario precision requires a much higher sample size, with 1563 controls and 438 treatments.

Once the appropriate sample size based on statistical power has been determined, one may inquire about the expected outcomes of the trial. Simulation enables us to assess the potential results of a trial conducted under optimal conditions. Using this app, we can perform several simulations comparing typical results of many trials and compare them. In Figure 5 we show simulations with the specific conditions to conclude superiority. We have computed a sample size of 35 treatments and 126 controls (power of 0.8, minimum effect of 0.1 for the considered scenario). The simulated results show several interesting issues:

After evaluating 2000 trials, 79.05% conclude superiority (that is a statistical power of about 80%)
Each trial produces a relatively wide confidence interval. Thus, for a given trial we cannot obtain a precise estimation of the treatment effect. In that situation, we cannot conclude clinical superiority (only 36.75% of the trials show clinical superiority, hence a poor statistical power for this goal). We should increase the sample size to 141 treatments and 502 controls to achieve a 80% power for clinical superiority).
The 97.05% of the trials show non-inferiority (we are using a much higher sample size than the minimum for an 80% statistical power in that case).
It is important to stress that those are results of many trials. In practice, we would obtain one of those results. Thus, this makes especially relevant understanding the appropriate sample size after considering all the issues involved and minimizing the risk of obtaining misleading results.

The user can test the different conditions and decide which are the appropriate settings for a specific situation.

In the previous results, it is important to emphasize that statistical power does not assure an appropriate precision in effect estimation. In that example, the precision is and may be insufficient for practical conclusions. If we want to have a precision of we need to consider 1563 controls and 438 treatments. The difference is large, and the results are shown in Figure 6.

Exploring scenarios in a multi-arm clinical trial with a normal distributed response

We have developed a shiny app for simulating a multi-arm experiment and exploring several issues that should be considered in computing an appropriate sample size [26]. In this app, we include a brief introduction to the linear model and show the interpretation of the different sums of squares and the ANOVA table. The most important menu option is “Simulate a multi-arm trial”. This option has different panes (Figure 7.)

Let us plan an experiment with a control group (Arm 1) and two treatment groups (Arms 2 and 3). Assume that the control group has a mean biomarker concentration of 100 mg/ml, with a standard deviation of 3 mg/ml. We aim to compute what sample size reaches a statistical power of 0.8 for a significance level of 0.05. We will start with a sample size of 6 subjects per arm. As a demonstrative example, we require that the minimum effect in Arm 2 is an increase of 3 mg/ml in the mean, and of 5 mg/ml for Arm 3. In that situation, the settings in the app are indicated in Figure 7. The population distributions are shown in the panel “Population”, and the data of the simulated trial can be checked in the panels “Simulated data” and “Descriptive·. The panel “Check assumptions” Include a test for equality of variances and normality, two basic conditions for ANOVA.

In the “ANOVA results” option, the ANOVA table and the estimation of differences between groups using the Tukey method (Figure 8).

In that example, in the ANOVA table, we obtain a p-value of 0.0058 for the Arm effect. Therefore, we can conclude that our analysis suggests an effect due to the treatment. The pairwise confidence intervals for the differences in means among the arms (estimated effects) indicate that we should conclude a significant effect exists when comparing Arm 3 with Arm 1 (control), but no significant effect is observed in the other comparisons.

Let's compute the statistical power corresponding to the parameters defined and the required sample size to achieve a statistical power of 0.8. Additionally, we also compute the required sample size to achieve a precision of ±1 when comparing the effects of the different arms (Option: “Sample size, power and precision”, Figure 9).

With the considered initial sample size of 6, we had a statistical power of 0.648, which means that 64,8% of the samples that we will obtain using the parameters’ setting will identify the existence of effects. However, are we able to identify the true effect size? In that case, the effects of treatments with respect to controls are 3 and 5.

In the right panel, a computation of sample size for a power of 0.8 indicates that we need 8 subjects per group. After changing the sample size in the parameters’ settings, the ANOVA indicates a significant effect of Arms, but the pairwise estimation of the effects still fail for identifying the true effect (Figure 10).

With the new sample size, comparison 3-1 has a confidence interval between 0.64 and 7.95. This is too wide for practical conclusions. To compute an appropriate sample size, let’s consider a precision of 1. We obtain a required sample size of 92 (Figure 9). If we change this in the parameter’s settings, we can appreciate that we are estimating the effects with the required precision and discovering the true values (within the corresponding confidence intervals).

As we have greatly increased the sample size, now the statistical power is 1, which means that we will always conclude the existence of effects in the ANOVA. The difference is that now we will also obtain an appropriate estimate for the effects (Figure 11).

In a practical situation, this app allows exploring different scenarios when planning a parallel multi-arm clinical trial with a normally distributed response. These scenarios, similar to the previous case of a two-arm design, enable investigating the potential outcomes if a trial is conducted under such conditions. Here, we discuss some important issues that need to be considered when defining those scenarios:

It is critical to have a rough idea of the standard deviation of your data. Using the app, you can explore the effect of different values and select the scenario that you consider most relevant to your problem. For instance, you can estimate the value of the standard deviation in a pilot sample of subjects. Remember that high values of standard deviation will imply large sample sizes.
Statistical power is important. It addresses a different aspect of study design compared to the precision of effect estimation. The sample size required to achieve a given statistical power might not be sufficient for achieving precise estimation of treatment effects. Therefore, while statistical power ensures the likelihood of detecting an effect if it exists, achieving precision in estimating the size of that effect requires careful consideration of sample size and other factors. Both aspects are important in designing rigorous and informative clinical trials.
Exploring potential effects and defining the minimum effect to detect is important. This involves setting the minimum effect size that you consider meaningful for your study. The computed sample size will ensure reliable results for detecting those minimum effects or larger. While you may not know the true effects beforehand, it's crucial to establish a practical threshold that you find important in the context of your research.
Consider a mean value for the control group. While its specific value may not be critical for analysis, it can aid in interpretation. In practice, with a given standard deviation and a vector of effects, the results will remain the same regardless of the exact mean value of the control group.
Fix the precision for the confidence interval in pairwise comparisons between arms. This is crucial when computing the required sample size. As a rule of thumb, the precision should be tighter than the minimum effect you aim to detect.

Interactive tools simulate hypothetical scenarios that facilitate appropriate decision-making in trial planning by analyzing the interplay between sample size, statistical power, and the precision of effect estimation.

The most common question a clinical research group asks a biostatistical advisor is, "How many subjects do we need to include in the trial?" As we have discussed in this paper, this is far from a simple question and requires consideration of various factors. While it may seem straightforward for a statistician to address, our experience in collaborating with clinical groups reveals that understanding the process can be challenging for them.

The use of simulation is a valuable tool for taking practical decisions. For instance, the question on the importance of defining an equivalence range is not trivial. The impact of the value of this range on the trial design is facilitated by simulating different scenarios. The possibility of anticipating results that are not robust enough when using inappropriate sample sizes is an added value.

Unfortunately, some of the basic questions that are required for deciding on sample size (variability of data, effect size to detect, difference between statistical power and treatment effect, etc.) are not well grasped by researchers. We have discussed those questions and presented a couple of interactive shiny apps that can be used to enhance the understanding of all these important issues.

Although there are many computer tools to help compute sample size in clinical trials, few of them allow for easily simulating the practical implications of a specific selection. Here, we present two interactive applications that complement sample size computation with the simulation of different scenarios to help decide the appropriate number of subjects to include in a trial. Although we have focused on two basic designs, our results highlight the importance of this approach and the need to consider precision in estimating treatment effects in the decision process.

Ethics approval and consent to participate: ‘Not applicable’

Consent for publication: ‘Not applicable’

Availability of data and materials: The Shiny apps are freely available at the following URL:

https://irblleida-biostatistics.shinyapps.io/Clinical-Trial-Two-Arms-Binomial/

https://irblleida-biostatistics.shinyapps.io/ANOVA-One-Fixed-Factor/

Competing interests: The authors declare no competing interests.

Funding: This work has been supported by grant PI20/00377 from Instituto de Salud Carlos III (Spain). Grups de Recerca SGR-Cat 2021 Reconeguts per la Generalitat de Catalunya

Author’s contributions: AS defined the scope of this work. AS and PS developed the shiny apps. EV and RA tested the apps in different scenarios and suggest technical improvements. All the authors participate in writing the manuscript and approved the final version.

Acknowledgements: The authors are grateful to Dr. Erik Cobo and Dr. Jose Antonio Gonzalez from the Polytechnic University of Catalonia for their commentaries and suggestions in developing the shiny apps.

Kay R. Statistical principles for clinical trials. Journal of International Medical Research. 1998;26.
Berger VW, Bour LJ, Carter K, Chipman JJ, Everett CC, Heussen N, et al. A roadmap to using randomization in clinical trials. BMC Med Res Methodol. 2021;21.
Lakens D. Sample Size Justification. Collabra: Psychology. 2022;8.
Nemes S, Jonasson JM, Genell A, Steineck G. Bias in odds ratios by logistic regression modelling and sample size. BMC Med Res Methodol. 2009;9.
Clark T, Berger U, Mansmann U. Sample size determinations in original research protocols for randomised clinical trials submitted to UK research ethics committees: Review. BMJ (Online). 2013;346.
Turner RM, Walter SD, Macaskill P, McCaffery KJ, Irwig L. Sample size and power when designing a randomized trial for the estimation of treatment, selection, and preference effects. Medical Decision Making. 2014;34:711–9.
Tracy M. Methods of sample size calculation for clinical trials. Power. 2010.
Bland JM. The tyranny of power: Is there a better way to calculate sample size? BMJ (Online). 2009;339:1133–5.
Gardner MJ, Altman DG. Confidence intervals rather than P values: Estimation rather than hypothesis testing. Br Med J (Clin Res Ed). 1986;292.
Gardner MJ, Altman DG. Estimating with confidence. Br Med J (Clin Res Ed). 1988;296.
Sones W, Julious SA, Rothwell JC, Ramsay CR, Hampson L V., Emsley R, et al. Choosing the target difference and undertaking and reporting the sample size calculation for a randomised controlled trial - The development of the DELTA2 guidance Suzie Cro. Trials. 2018;19.
Cook JA, Hislop J, Altman DG, Fayers P, Briggs AH, Ramsay CR, et al. Specifying the target difference in the primary outcome for a randomised controlled trial: Guidance for researchers. Trials. 2015;16.
Cook JA, Julious SA, Sones W, Hampson L V., Hewitt C, Berlin JA, et al. Practical help for specifying the target difference in sample size calculations for RCTs: The DELTA2 five-stage study, including a workshop. Health Technol Assess (Rockv). 2019;23.
Everest L, Chen BE, Hay AE, Cheung MC, Chan KKW. Power and sample size calculation for incremental net benefit in cost effectiveness analyses with applications to trials conducted by the Canadian Cancer Trials Group. BMC Med Res Methodol. 2023;23.
Freedman KB, Back S, Bernstein J. Sample size and statistical power of randomised, controlled trials in orthopaedics. J Bone Joint Surg Br. 2001;83-B:397–402.
Serdar CC, Cihan M, Yücel D, Serdar MA. Sample size, power and effect size revisited: Simplified and practical approachin pre-clinical, clinical and laboratory studies. Biochem Med (Zagreb). 2021;31:1–27.
Meyer EL, Mesenbrink P, Mielke T, Parke T, Evans D, König F. Systematic review of available software for multi-arm multi-stage and platform clinical trial design. Trials. 2021;22.
Faul F, Erdfelder E, Lang A-G, Buchner A. GPOWER: A general power analysis program. Behav Res Methods. 2007;39.
Chang W, Cheng J, Allaire J, Sievert C, Schloerke B, Xie Y, et al. shiny: Web Application Framework for R. https://github.com/rstudio/shiny. 2024.
Kang H. Sample size determination and power analysis using the G*Power software. Journal of Educational Evaluation for Health Professions. 2021;18.
Lee EC, Whitehead AL, Jacques RM, Julious SA. The statistical interpretation of pilot trials: Should significance thresholds be reconsidered? BMC Med Res Methodol. 2014;14.
Parker RA, Cook JA. The importance of clinical importance when determining the target difference in sample size calculations. Trials. 2023;24.
Julious SA. Sample Size Calculations for Precision Clinical Trials with Normal Data. In: Sample Sizes for Clinical Trials. 2023.
Sorribas A, Vilaprinyó E, Sandoval P. Two-arms clinical trials with a binary outcome: A shiny app. https://irblleida-biostatistics.shinyapps.io/Clinical-Trial-Two-Arms-Binomial/. 2024.
Julious SA. Sample Size Calculations for Equivalence Clinical Trials with Normal Data. In: Sample Sizes for Clinical Trials. 2023.
Sandoval P, Sorribas A, Vilaprinyó E. Multi-arm clinical trials: A shiny app. https://irblleida-biostatistics.shinyapps.io/ANOVA-One-Fixed-Factor/. 2024.

No competing interests reported.

Download PDF

Reviews received at journal
14 Sep, 2024
Reviewers agreed at journal
23 Aug, 2024
Reviews received at journal
22 Aug, 2024
Reviews received at journal
21 Aug, 2024
Reviews received at journal
17 Aug, 2024
Reviewers agreed at journal
13 Aug, 2024
Reviewers agreed at journal
12 Aug, 2024
Reviewers agreed at journal
03 Aug, 2024
Reviews received at journal
31 Jul, 2024
Reviewers agreed at journal
29 Jul, 2024
Reviewers invited by journal
17 Jul, 2024
Editor assigned by journal
15 Jul, 2024
Submission checks completed at journal
15 Jul, 2024
First submitted to journal
12 Jul, 2024

You are reading this latest preprint version

Deciding the Appropriate Sample Size for Clinical Trials: A complex interplay between power, effect size, and cost

Status:

Version 1

Abstract

Figures

Background

Methods

Statistical power as a design goal

Precision in Estimating Treatment Effects in Clinical Trials as a design goal

Two-branch clinical trial with binary outcome

Multi-arms clinical trial with a normal distributed response.

Results

Exploring scenarios in a two-branch clinical trial with binary outcome with a shiny based application

Understanding the different goals: Equivalence, non-inferiority, and superiority

Sample size considerations in a two-branch clinical trial with a binary outcome

Exploring scenarios in a multi-arm clinical trial with a normal distributed response

Discussion

Conclusion

Declarations

References

Additional Declarations

Status:

Version 1