Run the same brms model on multiple datasets

Run the same brms model on multiple datasets and then combine the results into one fitted model object. This is useful in particular for multiple missing value imputation, where the same model is fitted on multiple imputed data sets. Models can be run in parallel using the future package.

brm_multiple(
  formula,
  data,
  family = gaussian(),
  prior = NULL,
  data2 = NULL,
  autocor = NULL,
  cov_ranef = NULL,
  sample_prior = c("no", "yes", "only"),
  sparse = NULL,
  knots = NULL,
  stanvars = NULL,
  stan_funs = NULL,
  silent = 1,
  recompile = FALSE,
  combine = TRUE,
  fit = NA,
  algorithm = getOption("brms.algorithm", "sampling"),
  seed = NA,
  file = NULL,
  file_compress = TRUE,
  file_refit = getOption("brms.file_refit", "never"),
  ...
)

Arguments

formula: An object of class formula, brmsformula, or mvbrmsformula (or one that can be coerced to that classes): A symbolic description of the model to be fitted. The details of model specification are explained in brmsformula.
data: A list of data.frames each of which will be used to fit a separate model. Alternatively, a mids object from the mice package.
family: A description of the response distribution and link function to be used in the model. This can be a family function, a call to a family function or a character string naming the family. Every family function has a link argument allowing to specify the link function to be applied on the response variable. If not specified, default links are used. For details of supported families see brmsfamily. By default, a linear gaussian model is applied. In multivariate models, family might also be a list of families.
prior: One or more brmsprior objects created by set_prior or related functions and combined using the c method or the + operator. See also default_prior for more help.
data2: A list of named lists each of which will be used to fit a separate model. Each of the named lists contains objects representing data which cannot be passed via argument data (see brm for examples). The length of the outer list should match the length of the list passed to the data argument.
autocor: (Deprecated) An optional cor_brms object describing the correlation structure within the response variable (i.e., the 'autocorrelation'). See the documentation of cor_brms for a description of the available correlation structures. Defaults to NULL, corresponding to no correlations. In multivariate models, autocor might also be a list of autocorrelation structures. It is now recommend to specify autocorrelation terms directly within formula. See brmsformula for more details.
cov_ranef: (Deprecated) A list of matrices that are proportional to the (within) covariance structure of the group-level effects. The names of the matrices should correspond to columns in data that are used as grouping factors. All levels of the grouping factor should appear as rownames of the corresponding matrix. This argument can be used, among others to model pedigrees and phylogenetic effects. It is now recommended to specify those matrices in the formula interface using the gr and related functions. See vignette("brms_phylogenetics") for more details.
sample_prior: Indicate if draws from priors should be drawn additionally to the posterior draws. Options are "no" (the default), "yes", and "only". Among others, these draws can be used to calculate Bayes factors for point hypotheses via hypothesis. Please note that improper priors are not sampled, including the default improper priors used by brm. See set_prior on how to set (proper) priors. Please also note that prior draws for the overall intercept are not obtained by default for technical reasons. See brmsformula how to obtain prior draws for the intercept. If sample_prior is set to "only", draws are drawn solely from the priors ignoring the likelihood, which allows among others to generate draws from the prior predictive distribution. In this case, all parameters must have proper priors.
sparse: (Deprecated) Logical; indicates whether the population-level design matrices should be treated as sparse (defaults to FALSE). For design matrices with many zeros, this can considerably reduce required memory. Sampling speed is currently not improved or even slightly decreased. It is now recommended to use the sparse argument of brmsformula and related functions.
knots: Optional list containing user specified knot values to be used for basis construction of smoothing terms. See gamm for more details.
stanvars: An optional stanvars object generated by function stanvar to define additional variables for use in Stan's program blocks.
stan_funs: (Deprecated) An optional character string containing self-defined Stan functions, which will be included in the functions block of the generated Stan code. It is now recommended to use the stanvars argument for this purpose instead.
silent: Verbosity level between 0 and 2. If 1 (the default), most of the informational messages of compiler and sampler are suppressed. If 2, even more messages are suppressed. The actual sampling progress is still printed. Set refresh = 0 to turn this off as well. If using backend = "rstan" you can also set open_progress = FALSE to prevent opening additional progress bars.
recompile: Logical, indicating whether the Stan model should be recompiled for every imputed data set. Defaults to FALSE. If NULL, brm_multiple tries to figure out internally, if recompilation is necessary, for example because data-dependent priors have changed. Using the default of no recompilation should be fine in most cases.
combine: Logical; Indicates if the fitted models should be combined into a single fitted model object via combine_models. Defaults to TRUE.
fit: An instance of S3 class brmsfit_multiple derived from a previous fit; defaults to NA. If fit is of class brmsfit_multiple, the compiled model associated with the fitted result is re-used and all arguments modifying the model code or data are ignored. It is not recommended to use this argument directly, but to call the update method, instead.
algorithm: Character string naming the estimation approach to use. Options are "sampling" for MCMC (the default), "meanfield" for variational inference with independent normal distributions, "fullrank" for variational inference with a multivariate normal distribution, "pathfinder" for the pathfinder algorithm, "laplace" for the laplace approximation, or "fixed_param" for sampling from fixed parameter values. Can be set globally for the current R session via the "brms.algorithm" option (see options).
seed: The seed for random number generation to make results reproducible. If NA (the default), Stan will set the seed randomly.
file: Either NULL or a character string. In the latter case, the fitted model object is saved via saveRDS in a file named after the string supplied in file. The .rds extension is added automatically. If the file already exists, brm will load and return the saved model object instead of refitting the model. Unless you specify the file_refit argument as well, the existing files won't be overwritten, you have to manually remove the file in order to refit and save the model under an existing file name. The file name is stored in the brmsfit object for later usage.
file_compress: Logical or a character string, specifying one of the compression algorithms supported by saveRDS. If the file argument is provided, this compression will be used when saving the fitted model object.
file_refit: Modifies when the fit stored via the file argument is re-used. Can be set globally for the current R session via the "brms.file_refit" option (see options). For "never" (default) the fit is always loaded if it exists and fitting is skipped. For "always" the model is always refitted. If set to "on_change", brms will refit the model if model, data or algorithm as passed to Stan differ from what is stored in the file. This also covers changes in priors, sample_prior, stanvars, covariance structure, etc. If you believe there was a false positive, you can use brmsfit_needs_refit to see why refit is deemed necessary. Refit will not be triggered for changes in additional parameters of the fit (e.g., initial values, number of iterations, control arguments, ...). A known limitation is that a refit will be triggered if within-chain parallelization is switched on/off.
...: Further arguments passed to brm.

Value

If combine = TRUE a brmsfit_multiple object, which inherits from class brmsfit and behaves essentially the same. If combine = FALSE a list of brmsfit objects.

Details

The combined model may issue false positive convergence warnings, as the MCMC chains corresponding to different datasets may not necessarily overlap, even if each of the original models did converge. To find out whether each of the original models converged, subset the draws belonging to the individual models and then run convergence diagnostics. See Examples below for details.

Parallelization with multiple CPU cores

brms can make use of multiple CPU cores in parallel to speed up computations in various ways. For efficient use of the available resources it is recommended to only use parallelism to an extend such that the available physical CPUs are not oversubscribed. For example, when you have 8 CPU cores locally available, then you may consider to run 4 chains with 2 threads per chain for best performance if you happen to just run a single model. In case you run a simulation study which requires to run many times a given model, then neither chain nor within-chain parallelization is advisable as the computational resources are already exhausted by the simulation study and any further parallelization beyond the simulation study itself will in fact slow down the overall runtime. Please be aware that for historical reasons the nomenclature of the arguments is possibly confusing. The cores argument refers to running different chains in parallel and the within-chain parallelization will allocate for each chain as many threads as requested. The requested threads therefore increase the use of overall CPUs in a multiplicative way.

For more advanced parallelization (including beyond single model fits), brms also integrates with the future package. Importantly, this enables seamless integration with the mirai parallelization framework through the use of the future.mirai adapter. With mirai local and remote machines can be used in a fully transparent manner to the user. This includes the possibility to use large number of remote machines running in the context of a computer cluster, which are managed with queuing systems. Please refer to the section on distributed computing of mirai::daemons.

Examples

if (FALSE) { # \dontrun{
library(mice)
m <- 5
imp <- mice(nhanes2, m = m)

# fit the model using mice and lm
fit_imp1 <- with(lm(bmi ~ age + hyp + chl), data = imp)
summary(pool(fit_imp1))

# fit the model using brms
fit_imp2 <- brm_multiple(bmi ~ age + hyp + chl, data = imp, chains = 1)
summary(fit_imp2)
plot(fit_imp2, variable = "^b_", regex = TRUE)

# investigate convergence of the original models
library(posterior)
draws <- as_draws_array(fit_imp2)
# every dataset has just one chain here
draws_per_dat <- lapply(1:m, \(i) subset_draws(draws, chain = i))
lapply(draws_per_dat, summarise_draws, default_convergence_measures())

# use the future package for parallelization
library(future)
plan(multisession, workers = 4)
fit_imp3 <- brm_multiple(bmi ~ age + hyp + chl, data = imp, chains = 1)
summary(fit_imp3)
} # }