fahb - Optimising progression criteria in multi-site pilot trials assessing recruitment | Methodological Research, Education and Innovation at LICTR

Introduction

Recruiting participants into clinical trials is hard, and if we don’t manage to do it well then the whole trial is at risk. Without participants and the information they provide, treatment comparisons will be underpowered and we are unlikely to see a statistically significant result even if we are lucky enough to have found an intervention which really works. Trials which aren’t able to answer their primary research question represent a huge drain on valuable resources (money, time, and, crucially, participant engagement) which might be better deployed elsewhere. They can also damage trust in the research process. We would like to identify these situations early on in the trials lifecycle and be able to shut down trials early if they are clearly not going to be able to recruit enough participants in a timely manner.

One common approach to this problem is to run a pilot trial and examine recruitment there. An external pilot is a small study which looks a lot like the main trial we hope to run, but in minature. An internal pilot constitutes the first phase of the main trial. Either way, the idea is that we can look at the pilot recruitment data and use it to judge if the main trial should go ahead / continue running. We have written an R package, fahb (feasibiliy assessment using hierarchical Bayes), to help trial teams do this well.

Making progress

By far the most common method to assessing feasibility with a pilot trial is to set up what we call progression criteria. These are a set of thresholds which all need to be met for us to be sufficiently sure that the main trial won’t fail to recruit. Different pilot trials will measure different things in the progression criteria, but there are some fairly common elements. When thinking about recruitment, the NIHR recommends we look at three things once the pilot has finished:

the number of participants recruited;
the number of sites opened to recruitment; and
the rate of participant recruitment, per site-year.

The idea is that we calculate these three summaries and compare them against pre-specified thresholds, and move forward only if they all exceed these thresholds. This seems pretty sensible, but the difficulty comes when choosing these thresholds. We need to be careful, since if we make them too lenient then they will be surpassed too easily and we’ll see lots of infeasible studies progressing. But we also need to make sure they are not too harsh, in which case too many good trials would fail at the pilot stage even though they are actually perfectly feasible. We know from Mellor et al.’s work that researchers often don’t feel confident in choosing their thresholds, worrying that they are fairly arbitrary.

Positives and negatives

What we really need is a way to take a suggested set of progression criteria thresholds and evaluate them. Then we can compare different suggestons and have a concrete way to choose the best one. One approach here is to think about the false positive and false negative rates which we obtain when applying a particular set of progression criteria thresholds. We call these the operating characteristics of the pilot.

By false positive rate, we mean the probability of getting a positive progression decision when in fact the trial is not going to recruit well. Conversely, the false negative rate is the probability of not getting a positive decision even though the trial is really feasible. But how do we calculate these measures?

One simple approach is to simulate some trials. Thinking about internal pilots for now, a trial simulation would generate a site opening and participant recruitment process according to an assumed statistical model. For that simulated trial we can make a note of how long it took until it hit its recruitment target, but also of the three pilot summaries described above. We can use the former to classify this trial as either feasible or infeasible by comparing the recruitment time to a threshold value which denotes the tipping point from feasible into infeasible. And we can compare the pilot summaries against our pre-specified thresholds to see if we would have progressed of stopped the trial early. This gives us two outcomes: the underlying feasibility of that simulated trial, and the progression decision made after the pilot phase.

We can repeat this many time (thousands, or even hundreds of thousands) and then use these simulated results to estimate our operating characteristics. We can get the false positive rate by looking at all the simulated trials which were really infeasible, and working out the proportion of these which decided to progress after the pilot. Similarly, the false negative rate is the proportion of all feasible simulations where we incorrectly decided to stop after the pilot. We can do this for different progression criteria thresholds, looking for some which give us a low false positive rate and a low false negative rate.

There’s something fishy about this model

We’ve proposed a particular model of site opening and participant recruitment, to let us simulate all these hypothetical trials. For site opening, we assume this proceeds randomly but with a consistent overall rate - specifically, we model openings using a Poisson process. To use fahb, the user needs to say something about what they expect the opening rate of this process to be.

We assume a similar model for the recruitment of participants to sites - that this will be random, but with a consistent rate. But we know that the rate of recruitment can vary between different sites, with some doing really well and others not so much. We capture this behaviour by using a hierarchical model along the same lines as proposed by Anisimov & Fedorov, who showed that it fits well to observed trial recruitment data. To use fahb the user needs to say what they expect about the overall recruitment rate, and also how much they expect the recruitment rates to vary between sites.

Deciding on decison rules

To use fahb we first set up a problem, then generate a set of simulations, before finally looking for good thresholds. To generate the problem we need the model specification as described above (we will use the package defaults here for illustration) along with our recruitment target N, number of sites m, and time t (in years) that we will analyse the internal pilot data at. For exmaple:

library(fahb)

# Initialise the problem
problem <- fahb_problem(N= 320, m = 20, t = 0.5)

# Run n_sims simulations
problem <- forecast(problem, n_sims = 10^4)

# Look for good PC thresholds
design <- fahb_design(problem)

The results can be summarised by plotting:

plot(design)

This graph shows us the operating characteristics (the false positive and false negative rates) of a range of different decision rules. Note that there are two methods here: “PC” are the standard progression criteria as described above, and “Bayes” are a different type of decision rule which involves a more complex analysis of the pilot data (for more information see the associated manuscript by Wilson et al.). We see that we need to trade off the false positive rate against the false negative rate, and how we do that will depend on our own individual priorities. We also see that the quality of decision rules using progression criteria is very close to the quality of those using the more involved Bayesian analysis.

We can see the exact error rates and the corresponding progression criteria thresholds of all the decision rules illustrated here by printing the design:

design

## Standard progression criteria
## 
##     FPR        FNR        n_p         m_p        r_p
## 1   0.0 0.87992181 20.5611795  1.66828093 10.3028119
## 11  0.1 0.45182910  8.9043663  1.60474181  6.9203619
## 21  0.2 0.32909802  7.4001946  0.30066758  5.9852812
## 41  0.4 0.18123429  4.8126219  1.18403923  4.9907542
## 51  0.5 0.12496509  2.4656527 -0.88932701  4.8339481
## 71  0.7 0.05487294  0.1053767 -0.75644908  3.6230448
## 81  0.8 0.03253281 -0.8056646  0.84545060  2.8640030
## 91  0.9 0.01717397 -0.3936578 -0.58875695  1.7408977
## 101 1.0 0.00000000 -0.8125629  0.01064443 -0.8196884
## 
## Bayesian approximation
## 
##     FPR        FNR      T_p
## 1   0.0 1.00000000 1.314099
## 11  0.1 0.45615750 3.102920
## 21  0.2 0.32016197 3.287019
## 41  0.4 0.16824909 3.584644
## 51  0.5 0.12021782 3.716581
## 71  0.7 0.05696733 3.931363
## 81  0.8 0.03295169 4.044890
## 91  0.9 0.01661547 4.161486
## 101 1.0 0.00000000 4.354789
## 
## FPR - False Positive Rate
## FNR - False Negative Rate
## 
## n_p, m_p, r_p - Probabilistic thresholds for standard
##                 progression criteria on the number recruited,
##                 number of sites opened, and the recruitment rate
##                 (participants per site per year) respectively
## 
## T_p - Bayesian decision rule threshold for the posterior predictive
##       expected time until full recruitment

By default this gives us the rules which give us FPRs of $0, 0.1, \ldots , 0.9, 1$, but you can see a finer grained version by using print.fahb_design(design, coarse = FALSE). Once we have decided on what kind of balance between the OCs we are happy with, we can read of the associated progression criteria thresholds and use them for our pilot. Note that some of these thresholds can be negative, in which case they are redundant and we can just ignore them. Again, the results for the alternative Bayesian approach to making progression decisions are also given here.

Seeing the future

If we are using pre-specified progression criteria then analysis is straightforward - we just calculate the three summary measures and compare them against the thresholds we have chosen. We can also use fahb to run a Bayesian analysis of the recruitment data and provide a predictive distribution, conditional on what we have observed in the pilot, of the time needed to hit our recruitment target. To do this we need to provide the pilot recruitment data in for the form of a vector n_pilot of recruitment numbers at each open site, and a corresponding vector t_pilot of how long (in years) these sites have been open.

Because this function runs a Bayesian analysis using Stan, it needs to first compile the associated Stan code and this can take a bit of time. Once this has been done, if you want to run the analysis again you can extract the compiled model and provide it to another call of fahb_analysis so it can be re-used.

fahb_analysis(n_pilot = c(1,4,3),
              t_pilot = c(0.43, 0.2, 0.07),
              problem)

## Compiling the model...

## the number of chains is less than 1; sampling not done

## Standard progression criteria statistics:
##      n_p      m_p      r_p 
##  8.00000  3.00000 11.42857 
## 
## Expected posterior predictive time to recruit:
## exp_pp_T 
## 3.014911 
## 
## Posterior predictive distribution quantiles:
##     0.5%     2.5%      20%      50%      80%    97.5%    99.5% 
## 1.934566 2.125245 2.548785 2.946941 3.439751 4.291749 4.917994 
## 
## Posterior site opening rate hyperparamaters (Gamma):
## shape  rate 
## 33.00  3.35

We can then use this predictive distribution to decide if the trial should continue or not, either by making a judgement directly or by applying a pre-specified threshold for the expectation of this predicted time as suggested by the fahb_design object.

Size matters

We can use fahb to design and analyse external pilots, too. The main difference at the design stage is that we need to specify how many sites to have in the pilot (m_ext), and the target recruitment (n_ext). We still set a time t for the pilot analysis, which now acts as a maximum: we will stop the external pilot when we hit the target recruitment or time t, whichever comes first.

The operating characteristics of an external pilot are going to be impacted by its sample size, so we can use fahb to explore this and help us choose. For example, let’s compare two pilots recruiting 60 participants over 1 year, but with one using 3 sites and the other using

We will focus on decision rules of the standard progression criteria form.

library(ggplot2)

problem <- fahb_problem(N = 320, m = 20, 
                        t = 1, n_ext = 60, m_ext = 3)
problem <- forecast(problem)
design_m3 <- fahb_design(problem)

problem <- fahb_problem(N = 320, m = 20, 
                        t = 1, n_ext = 60, m_ext = 10)
problem <- forecast(problem)
design_m10 <- fahb_design(problem)

# Put the OCs of the two designs into a data frame for plotting together
df <- rbind(design_m3$Prog_Crit_OCs, design_m10$Prog_Crit_OCs)
df$m <- factor(rep(c(3, 10), each = nrow(design_m3$Prog_Crit_OCs)))

ggplot(df, aes(x = FPR, y = FNR, colour = m)) + geom_step() +
  theme_minimal()

This shows what kind of benefit we could expect from increasing the number of sites in the pilot. For example, if we wanted a false positive rate of 0.2, increasing from 3 to 10 sites would bring the false negative rate down by around 0.1. We could continue in this manner to look at other choices for n_ext, m_ext and t until we find the right balance.

Green-lighting progression criteria

In fahb we have provided one potential solution to the problem of choosing good progression criteria thresholds, and of choosing other aspects of pilot trial design such as the timing of an internal pilot or the sample size of an external one. We have also shown that progression criteria based on the three summary measures suggested by NIHR can be of high quality, in the sense that they lead to operating characteristics almost as good as decision rules based on a full Bayesian analysis of the pilot data. This all means that if we are happy with the modelling assumptions encoded in fahb, we can safely use it to determine good progression criteria which give the desired balance between the risks of false positives and false negatives, making them less arbitrary and more defensible.

Further work

To extend this work further, it would be nice to consider other models of the recruitment process. For example, we may anticipate a less variable pattern of site opening than the Poisson process we have assumed would give; or we may even want to model a fully deterministic process where we set up a plan for site opening and assume it will be adhered to exactly. Other approaches to modelling like those surveyed by Gkioni et al. could also be considered. A modular approach could be explored to allow for different models for site opening and for participant recruitment to be mixed and matched. If we were to use different models, we would need to check if the same qualitative results hold - i.e. if standard progression criteria are still able to provide good decision rules when compared to Bayesian alternatives.

Before any of that, the priority should probably be helping people determine the model inputs required to use fahb - something we have glossed over a little here by just using the function defaults. These inputs define the prior distributions of the three model parameters, and they could be determined by using historic data, through expert elicitation, or some mixture of both. This will likely need some more methodological research; for example, working out the best way to elicit expert beliefs around likely site recruitment rates and the variability between them.

References

Anisimov, V.V. & Fedorov, V.V. (2007). Modelling, prediction and adaptive adjustment of recruitment in multicentre trials. Statistics in Medicine 26(27), 4958–497.
Gkioni, E., Ruis, R., Dodd, S. & Gamble, C. (2019). A systematic review describes models for recruitment prediction at the design stage of a clinical trial. Journal of Clinical Epidemiology 115, 141-149.
Mellor, K., Dutton, S. J., Hopewell, S. & Albury, C. (2022). How are progression decisions made following external randomised pilot trials? A qualitative interview study and framework analysis. Trials 23(123).
Wilson, D.T., Cowtan, S., Vyner, C. (2026). Optimising progression criteria in multi-site pilot trials assessing recruitment.