This section provides information on and examples of a variety of methodological issues investigators may need to address when designing, conducting, and analyzing a study.
These challenges are arranged into three categories: The first category describes issues common to most clinical trials and many other studies, the second presents issues associated with specific designs, and the third category addresses specialized methods.
Methods Applicable to Most Clinical Trials and Many Other Studies
Most clinical trials should include one or more clearly defined primary outcomes. Investigators should specify the measure(s) that are being used including the time frame and the method of aggregating data (e.g., mean value, mean change from baseline, percent with a specified value, or time to event) (). Investigators should specify the statistical test to be used for the primary analysis and the timing of that analysis in terms of the number of subjects, events, and/or calendar time. Artificially dichotomized endpoints should be avoided and the use of surrogate endpoints should be justified. In addition, the effect of missing data or censoring due to death or withdrawal on the outcome should be considered.
Sample Size Estimates
Sample size estimates for clinical trials should specify significance level, power, allocation ratio, target difference, adjustments for loss to follow-up or multiple comparisons, and whether differences between groups will be absolute or relative. The target difference and other parameters should be justified based on relevant prior data, and sources should be identified (). Sample size estimates for time-to-event outcomes should include the accrual time, duration of follow-up, and expected number of events. Restricting inclusion to higher risk patients to accrue events faster and with a smaller sample size may be a reasonable consideration, but it should be weighed against the generalizability of the results. Sufficient information should be included to allow reviewers to replicate the sample size estimate and the reference used should be cited.
Because adjusting for missing outcomes requires assumptions, investigators should strive to minimize the amount of missing outcome data (). If the probability that the outcome is missing depends on baseline variables, investigators should condition the analysis on those baseline variables. For a univariate outcome, this conditioning leads to a complete case-analysis ( ). With many baseline variables that predict missing in outcome, investigators should consider propensity-to-be-missing scores ( ). To adjust for missing longitudinal outcomes, investigators should consider maximum likelihood ( ) or multiple imputation methods ( ) that allow the probability that the outcome is missing to depend on previous outcomes and possibly baseline variables. If there is considerable uncertainty about the missing-data mechanism, investigators should consider a sensitivity analysis ( ), which may include a worst-case scenario. Investigators should discuss how they will try to reduce unplanned missing outcomes and how they propose to analyze the data if there are missing outcomes.
- Example: In a randomized trial for two-therapies with a survival outcome, investigators computed propensity-to-be-missing scores based on baseline covariates of age and a biomarker, which were thought to affect both survival and dropout ( ). The survival analysis conditioned on the quintile of the propensity-to-be missing scores.
Investigators who propose a study with multiple treatment arms, more than one outcome, or interim analyses may need adjust for multiple comparisons (; ). Primary outcomes should be adjusted for multiplicity using the most rigorous methods. Secondary outcomes may be adjusted at somewhat more liberal levels. Adjustment is also appropriate for exploratory outcomes to avoid wasting resources in future studies. Classical Bonferroni p-value adjustment is simple to implement but has lower power than other methods ( ). Investigators should detail their multiple adjustment procedure and demonstrate appropriate consideration of multiple adjustment in analysis, sample size calculation, and interpretation of results.
- Example: Researchers evaluated the effect of baricitinib, methotrexate, or a combination of baricitinib and methotrexate in patients with rheumatoid arthritis ( ). As there were multiple treatment arms and several outcomes, adjustment for multiple comparisons was particularly important. The researchers used a weighted, sequentially rejective, closed, Bonferroni-based multiple testing procedure to provide strong control of the family-wise error rate ( ).
Methods Recommended for Specific Designs
Randomization in Randomized Controlled Trials (RCTs)
A study in which individuals are randomized to study arms and observations on those individuals are analyzed to evaluate the effect of an intervention is called a randomized controlled trial (RCT). Randomization insulates the study from many biases. Even so, there are many types of randomization and not all are equally adept at reducing or eliminating biases. A good randomization method simultaneously controls both chronological bias (by distributing units to study arms evenly over time) and selection bias (by remaining unpredictable). Maximum Tolerated Imbalance (MTI) randomization methods control chronological bias and provide good encryption, thereby reducing selection bias (https://ctrandomization.cancer.gov/). The very common method of permuted block randomization, by comparison, does a good job controlling chronological bias but provides little encryption and so provides little protection against selection bias.); these methods have been automated and are freely available (
Parallel Group- or Cluster-Randomized Trials (GRTs)
A study in which groups or clusters are randomized to study arms and individual observations are analyzed to evaluate the effect of the intervention is called a parallel group- or cluster-randomized trial (GRT) (parallel GRT page.). In these studies, special methods are warranted for analysis and sample size estimation. Investigators should show that their analytic and sample size methods are appropriate given their plans for assignment of participants and delivery of interventions. Additional information is available on the
- Example: Dialysis facilities were randomized to one of two phosphate management strategies. All patients seen at a given facility received the same intervention based on the facility randomization. The primary outcome was a composite of mortality and hospitalization outcomes.
Individually Randomized Group-Treatment (IRGT) Trials
A study in which participants are randomized individually to study arms but receive at least some of their intervention in a real or virtual group or through a shared facilitator is called an individually randomized group-treatment (IRGT) trial (IRGT page.). Additional information is available on the
- Example: Patients with chronic low back pain were randomized individually to intervention or control. Within the intervention arm, patients were nested within acupuncturists, each of whom treated multiple patients.
Stepped Wedge Group- or Cluster-Randomized Trials (SWGRTs)
A study in which groups of participants are randomly assigned to sequences and the intervention is delivered on a staggered schedule until all sequences receive the intervention is called a stepped wedge group- or cluster-randomized trials (SWGRT or SWCRT) (SWGRT page.). In these studies, special methods are warranted for analysis and sample size estimation. Investigators should discuss any assumption that temporal trends are the same across the staggered groups and present justifications for the use of stepped wedge designs over the parallel GRT or CRT. ( ; ). Logistical or political considerations are examples of possible justifications ( ), but investigators should tailor justifications to their specific situation. Accounting for variance inflation due to group randomization in stepped wedge designs is complex due to the multilevel and longitudinal nature of the data, such as whether the trial is cross-sectional, closed cohort, or open cohort ( ). Decaying between-period correlations further complicate this process and should also be considered ( ). Investigators should consider if intervention effects will change as a function of exposure time, as not accounting for such changes can severely bias estimates of intervention effects and standard errors ( ; ; ). Investigators should address these issues in their analytic plan and in their sample size calculations. Additional information is available on the
- Example: A Managed Aquifer Recharge (MAR) system can reduce salt content in groundwater. The MAR Trial investigated the effect a MAR system has on blood pressure in 16 coastal communities over a period of five months ( ). Four new communities received the MAR system each month, with all 16 communities receiving the system by the end of the trial.
Effectiveness-Implementation Hybrid Studies
Hybrid effectiveness-implementation studies are clinical trials designed to assess the impact of an intervention on clinical effectiveness and implementation. Curran and colleagues proposed three hybrid types (; ): Hybrid Type 1 studies test the effects of a clinical intervention on relevant outcomes while observing and gathering information on implementation, Hybrid Type 2 studies test clinical and implementation strategies simultaneously, and Hybrid Type 3 studies test implementation strategies while observing and gathering information on the intervention’s impact on relevant outcomes. For Hybrid Type 2 studies, dual randomized controlled trials may be employed to conduct rigorous investigations of both implementation strategies and intervention programs ( ). In Hybrid Type 2 and 3 studies, implementation strategies can be developed and evaluated utilizing an implementation research framework; examples include RE-AIM ( ; ) and the Consolidated Framework for Implementation Research ( ; ). More information on using implementation frameworks is available ( ). Because Hybrid Type 2 and 3 studies emphasize implementation outcomes, randomization at a group or health-care facility level may be proposed, often using a parallel or stepped-wedge group- or cluster-randomized trial (see above).
- Hybrid Type 1 Study. The effectiveness and safety of a physical activity intervention with demonstrated efficacy for breast cancer survivors was evaluated in a community-based setting to allow simultaneous assessment of barriers to implementation ( ).
- Hybrid Type 2 Study. An intervention was developed to address the co-occurrence of diabetes and depression and was pilot tested with good results. A Hybrid Type 2 study examined the effectiveness of the intervention and tested an implementation strategy for increasing intervention use and fidelity ( ).
- Hybrid Type 3 Study. A randomized trial was conducted to test the impact of a bundle of implementation strategies on adoption and sustainability of an intervention to improve physical function in community-dwelling disabled and older adults ( ).
Multiphase Optimization STrategy (MOST)
MOST is an approach for the optimization and evaluation of an intervention. It consists of three phases of research. The preparation phase includes the identification of the potential components of the intervention, called candidate components; the development of a conceptual model that identifies the mechanism(s) by which the components are expected to affect the outcome; and the establishment of feasibility. The optimization phase includes testing the components individually and in combination using carefully conducted and adequately powered optimization trials, and then, based on the experimental results, selecting the components and component levels that make up the optimized intervention. The optimized intervention may then be evaluated in a randomized control trial (RCT). Interventions can be fixed (applied to all participants in the same way) or adaptive (varied over participants and/or time). An overview is available () as are more complete treatments ( ; ).
- Example: For youth with migraine, MOST was employed to examine the effects in reducing headache days of a mind and body intervention package. The preparation phase was completed using qualitative research to identify a set of candidate intervention components. In the optimization phase the candidate components were tested in a randomized optimization trial. The results of the optimization trial formed the basis for selection of the components that made up the optimized intervention. The optimized intervention package will be evaluated for its effectiveness in a future RCT.
Sequential, Multiple Assignment, Randomized Trial (SMART)
SMART designs (; ) can help investigators optimize adaptive interventions (also called by other names such as stepped care, dynamic treatment regimens and treatment policies). An adaptive intervention is a sequence of pre-specified decision rules that guide how dynamic information about the individual should be used in practice to decide whether and how to intervene at specific points in time during treatment (
;). The SMART is an experimental design that involves multiple stages of randomization, meaning that some or all of study participants can get randomized more than once during the trial. Each stage of randomization corresponds to a point in time in which there are scientific questions about whether, how, and under what conditions to intervene in order to build an optimized adaptive intervention. Guidance for SMART designs is available ( ; ).
- Example: In a SMART to empirically inform the development of an adaptive intervention for patients with chronic sickle cell pain, patients will first be randomized to guided relaxation or acupuncture. After six weeks, patients showing early signs of non-response (based on pre-specified criteria) will be re-randomized to continue their initial treatment assignment or switch to the other treatment. Those showing early signs of response (based on pre-specified criteria) will continue with assessment only. The primary outcome will be pain at 12 weeks.
Investigators who wish to evaluate interventions on chronic or stable conditions may propose a study utilizing a cross-over design. Cross-over designs are statistically efficient and allow for within-subject comparisons. However, carry-over effects, where the effect of a treatment lingers past treatment withdrawal, present difficulties in analysis and may require lengthy washout periods to address. In addition, period effects induced by changes in the outcome over time may introduce further challenges. Since cross-over designs depend on participants receiving two or more treatments at different time periods, participant drop out is of special concern. Investigators should justify their use of a cross-over design, as well as show consideration of carry-over effects, period effects, and missing data in their analytic plan (; ). Cross-over trials utilizing group randomization must also consider the impact of intraclass correlation in design and analysis ( ).
- Example: To investigate the effect of duloxetine on painful chemotherapy-induced peripheral neuropathy, researchers randomly assigned participants to receive either duloxetine followed by placebo or placebo followed by duloxetine ( ). A two-week washout period was implemented after each treatment period to limit carry-over effects.
Noninferiority trials are trials that are designed to determine whether a new intervention is not appreciably worse than an active control or standard treatment. Investigators who propose a noninferiority trial should justify the choice of this design over a standard superiority trial. Commonly used reasons include reduced cost, increased convenience, or fewer side effects. Noninferiority is established if the upper 95% confidence interval of the difference between the new and standard intervention is less than the noninferiority margin. The choice of the noninferiority margin should be justified (). Sample size estimates and statistical methods should discuss the chosen alpha level, the handling of losses to follow-up and nonadherence, and analysis of both intent-to-treat and per protocol cohorts ( ).
Regression Discontinuity Designs
Investigators who wish to evaluate an intervention in a setting in which it is difficult to implement a randomized trial should consider using the regression discontinuity design (RDD) (; ; ). RDD applies when clinical practice or public health programs use a cutoff point on a continuous variable to assign treatment deterministically (sharp RDD) or probabilistically (fuzzy RDD). Examples of the continuous assignment variable may include a biomarker level, blood pressure, age, calendar time, or a score reflecting need for the treatment. The RDD can provide strong evidence for causal inference near the cutoff ( ), where there is no expectation for responses to differ in the absence of a treatment effect, while other non-randomized designs do not provide the same strength of evidence. Investigators proposing the RDD should provide a rationale for choosing a sharp or fuzzy RDD, address the plausibility of the assumptions ( ), consider using latent class instrumental variables ( ) to compute the complier average causal effect (CACE) with fuzzy RDD, and perform a sensitivity analysis ( ).
- Example: In a prostate cancer screening study, the continuous assignment variable was the level of prostate-specific antigen (PSA). A value of PSA larger than 4 μg/l led to a biopsy recommendation. Because not all persons recommended for biopsy received a biopsy, this is an example of fuzzy RDD. Investigators found that PSA screening did not reduce prostate cancer-specific mortality ( ).
Paired Availability Designs
Investigators who wish to evaluate an intervention with a short-term outcome in a setting in which it is difficult to implement a randomized trial should consider using the paired availability design (PAD) (). With the PAD, investigators increase availability of treatment at multiple sites and estimate the effect of receipt of treatment using the method of latent class instrumental variables ( ; ). To reduce bias, the sites should be geographically or institutionally isolated and there should be no change in staff or protocol over the time period of the study. Investigators proposing a PAD should discuss study duration, plans to keep staff and protocols unchanged over time, choice of sites, and plausibility of assumptions.
- Example: Investigators used a PAD to estimate the effect of receipt of epidural analgesia on the probability of Cesarean-section ( ; ). They analyzed data from army medical centers and geographically isolated hospitals before and after the increased availability of epidural analgesia. The results agreed with those of a meta-analysis of randomized trials. The results differed from those of a high-quality multivariate observational study that likely omitted an important confounder of high pain in labor.
Personalized or N-of-1 Trials
Personalized trials — also referred to as N-of-1 trials, idiographic trials, or single-case experimental designs — are trials in which individuals cross over between intervention, placebo, or usual-care conditions across multiple time periods to assess individual response to treatments (; ). Such trials emphasize treatment responses to interventions by a single individual as opposed to an overall average intervention effect for a group of individuals. Personalized trials are especially useful when applied to individuals in subpopulations who may not be well represented in an overall trial, such as those with rare diseases, children, or the elderly ( ; ; ). Results from personalized trials should report effect sizes and confidence/credible intervals along with study results, provide enough information to facilitate meta-analysis, and report autocorrelations and their impact on effect sizes ( ; ). Rigorous methods for design and analysis are available ( ; ; ).
- Example: Statin-related myalgia (muscle pain) is one of the major side effects of statin medications and is a primary reason why patients discontinue statin therapy. Clinicians theorize that some patients receiving statin medications were actually suffering from a nocebo effect, wherein other existing muscle pains and symptoms were being erroneously attributed to statin medications. This may cause patients to stop taking statins unnecessarily, thereby increasing their risk of cardiovascular disease events. Several researchers decided to test this hypothesis using personalized N-of-1 trials ( ). These investigators conducted 200 personalized N-of-1 trials where participants were randomized to six 2-month intervention periods. All enrolled participants were individuals who had previously stopped statin medications due to muscle pain. This crossover trial design had participants alternate between daily treatment periods of receiving atorvastatin 20 mg and receiving a placebo pill. Participants rated their muscle pain daily using a validated visual analogue scale of pain ranging from 0 to 10. The investigators then compared pain symptoms between statin periods and placebo periods using linear mixed model regressions. Of the initial sample, 151 participants contributed data to this analysis that showed no significant difference in pain symptoms between atorvastatin and placebo periods. This suggests that among the study sample, statins did not significantly increase pain. Two-thirds of the participants in the study restarted their statin medications after seeing the trial results. This study shows the potential of using personalized trial designs to experimentally study important clinical issues at the level of the individual patient.
The familiar model-based tests rely on distributional and other assumptions that may not hold, threatening the validity of the inferences based on those tests. That situation is more likely when the sample size is small, as the asymptotic properties of the models may not apply, but problems can occur in a variety of situations (StatXact, SAS, STATA, SPSS, SYSTAT).). Exact tests, also called randomization tests, avoid those assumptions and rely instead only on the randomization scheme in the study at hand ( ). Exact tests are applicable to any type of outcome (continuous, dichotomous, count, time to event, etc.) and may be used with virtually any randomized design to evaluate the effect of treatment ( ) whether randomization occurs at the level of the individual or group ( ; ). Numerous textbook treatments on exact tests are available ( ; ; ) as are statistical software packages for their implementation (e.g.,
Adherence to Behavioral Interventions
In clinical trials with behavioral interventions, adherence refers to the extent to which participants comply with the intervention, including session attendance, level of session participation, and completion of any assignments. Both level and direction of adherence need to be defined to provide information on dose-response relationships and the consequences of nonadherence. Level of adherence can be assessed using absolute number of sessions attended, percent adherence, or categories. Direction of adherence, typically specified as under- or over-exposure, can also be indicated by absolute number of sessions attended and percent adherence. In addition to the intention-to-treat analysis where all randomly allocated patients are included in the analysis and analyzed in the groups to which they were assigned, an analytical approach frequently applied to adjust for suboptimal adherence is the per-protocol analysis, where data from individuals who fail to achieve a minimal level of adherence are not included (). Other statistical methods include the marginal structural model with inverse-probability weighting ( ) and the method of latent class instrumental variables for all-or-none compliance ( ). Investigators should also attend to the related possibility of differential measurement error in adherence measures ( ; ).
Noncompliance in Encouragement Designs
An encouragement design randomizes participants to encouragement or no encouragement to receive treatment. A special case is when encouragement is an invitation to receive treatment. Investigators who wish to evaluate an encouragement design have several options. One choice is an intent-to-treat analysis which takes advantage of randomization to avoid bias but only estimates the effect of encouragement and not the effect of the treatment received (). A second choice is a per-protocol analysis which estimates the effect of the treatment received but could yield biased estimates because it does not use the randomization. A third choice is the method of latent class instrumental variables ( ) which estimates the complier-average causal effect (CACE), uses randomization to avoid bias, and requires assumptions that are often satisfied when the treatment starts soon after randomization. Investigators should consider the method of latent class instrumental variables for analyzing an encouragement design.
- Example: Investigators randomized pregnant women who smoked to encouragement to stop smoking or no encouragement ( ). The fraction stopping smoking soon after randomization was 0.20 in the no-encouragement group and 0.43 in the encouragement group. The fraction of children born with low weight was 0.089 in the no-encouragement group and 0.068 in the encouragement group. The CACE for the effect of smoking cessation on the fraction of children born with low weight was (0.089-0.068)/(0.43-0.20)=0.091.
Evaluation of Risk Prediction Models
The purpose of risk prediction is to help investigators make treatment decisions. For example, they might recommend tamoxifen for women at high risk of developing invasive breast cancer. Investigators need to carefully consider the metric for evaluating risk prediction. Purely statistical metrics such as the odds ratio, the area under the ROC curve, and the subset relative risk (probability of disease in a high-risk subset divided by prevalence of disease) can lead to different conclusions when applied to the same data set. Also, purely statistical metrics do not account for the anticipated cost and benefits of treatment nor the cost of data collection, which limits their relevance when the goal is to inform decision-making. To fully evaluate risk prediction metrics, investigators should consider decision-analytic metrics such as decision curves (), relative utility curves ( ), and test tradeoffs ( ).
- Example: When comparing two models to predict the risk of invasive breast cancer, one with SNP data, and one without SNP data, an investigator computed a test-tradeoff of 3100 for the addition of SNP data to the model ( ). In other words, to increase net benefit, it is necessary to collect SNP data from 3100 persons for every correct prediction of invasive breast cancer. If SNP data collection were inexpensive, the test tradeoff of 3100 would likely be acceptable and investigators could recommend including SNPs in the risk prediction model.
Evaluation of Surrogate Endpoints
Investigators who wish to shorten a randomized trial by using a surrogate endpoint should evaluate the surrogate endpoint and consider the limitations of this approach. The Prentice Criterion () requires a detailed understanding of the biological pathway that is typically lacking. A high correlation between the surrogate and true outcomes on a treatment arm can be misleading ( ). The proportion of the treatment effect explained by the surrogate endpoint usually has a wide confidence interval ( ). For a more convincing analysis, investigators should evaluate surrogate endpoints using a meta-analytic approach involving trials with interventions in a similar class ( ; ). In some treatment trials, the class may include type of therapy (e.g., drugs with the same mechanism of action), clinical setting (e.g., line of therapy for the same disease), and standard of care ( ). Two meta-analytic methods are estimating the surrogate threshold effect ( ) and checking the plausibility of 5 criteria for surrogacy, 2 statistical and 3 involving clinical and biological considerations ( ).
- Example: Using a meta-analytic approach with 5 criteria for surrogacy, an investigator evaluated cancer recurrence at 3 to 6 months as a surrogate endpoint for advanced-stage colorectal cancer ( ) in 10 historical trials. The investigator reported acceptable values for the 2 statistical criteria ---the sample size multiplier was less than 1.5, and the prediction separation score was greater than 1. The investigator also discussed the plausibility of the 3 non-statistical criteria, similarity of biological mechanisms, similarity of secondary treatments, and a negligible risk of harmful side effects after the surrogate endpoint.