vignettes/brms_phylogenetics.Rmd
brms_phylogenetics.RmdIn the present vignette, we want to discuss how to specify phylogenetic multilevel models using brms. These models are relevant in evolutionary biology when data of many species are analyzed at the same time. The usual approach would be to model species as a grouping factor in a multilevel model and estimate varying intercepts (and possibly also varying slopes) over species. However, species are not independent as they come from the same phylogenetic tree and we thus have to adjust our model to incorporate this dependency. The examples discussed here are from chapter 11 of the book Modern Phylogenetic Comparative Methods and the application in Evolutionary Biology (de Villemeruil & Nakagawa, 2014). The necessary data can be downloaded from the corresponding website (https://www.mpcm-evolution.com/). Some of these models may take a few minutes to fit.
Assume we have measurements of a phenotype, phen (say
the body size), and a cofactor variable (say the
temperature of the environment). We prepare the data using the following
code.
phylo <- ape::read.nexus("https://paul-buerkner.github.io/data/phylo.nex")
data_simple <- read.table(
"https://paul-buerkner.github.io/data/data_simple.txt",
header = TRUE
)
head(data_simple)The phylo object contains information on the
relationship between species. Using this information, we can construct a
covariance matrix of species (Hadfield & Nakagawa, 2010).
A <- ape::vcv.phylo(phylo)Now we are ready to fit our first phylogenetic multilevel model:
model_simple <- brm(
phen ~ cofactor + (1|gr(phylo, cov = A)),
data = data_simple,
family = gaussian(),
data2 = list(A = A),
prior = c(
prior(normal(0, 10), "b"),
prior(normal(0, 50), "Intercept"),
prior(student_t(3, 0, 20), "sd"),
prior(student_t(3, 0, 20), "sigma")
)
)With the exception of (1|gr(phylo, cov = A)) instead of
(1|phylo) this is a basic multilevel model with a varying
intercept over species (phylo is an indicator of species in
this data set). However, by using cov = A in the
gr function, we make sure that species are correlated as
specified by the covariance matrix A. We pass
A itself via the data2 argument which can be
used for any kinds of data that does not fit into the regular structure
of the data argument. Setting priors is not required for
achieving good convergence for this model, but it improves sampling
speed a bit. After fitting, the results can be investigated in
detail.
summary(model_simple)
plot(model_simple, N = 2, ask = FALSE)
plot(conditional_effects(model_simple), points = TRUE)The so called phylogenetic signal (often symbolize by
)
can be computed with the hypothesis method and is roughly
for this example.
hyp <- "sd_phylo__Intercept^2 / (sd_phylo__Intercept^2 + sigma^2) = 0"
(hyp <- hypothesis(model_simple, hyp, class = NULL))
plot(hyp)Note that the phylogenetic signal is just a synonym of the intra-class correlation (ICC) used in the context phylogenetic analysis.
Often, we have multiple observations per species and this allows to fit more complicated phylogenetic models.
data_repeat <- read.table(
"https://paul-buerkner.github.io/data/data_repeat.txt",
header = TRUE
)
data_repeat$spec_mean_cf <-
with(data_repeat, sapply(split(cofactor, phylo), mean)[phylo])
head(data_repeat)The variable spec_mean_cf just contains the mean of the
cofactor for each species. The code for the repeated measurement
phylogenetic model looks as follows:
model_repeat1 <- brm(
phen ~ spec_mean_cf + (1|gr(phylo, cov = A)) + (1|species),
data = data_repeat,
family = gaussian(),
data2 = list(A = A),
prior = c(
prior(normal(0,10), "b"),
prior(normal(0,50), "Intercept"),
prior(student_t(3,0,20), "sd"),
prior(student_t(3,0,20), "sigma")
),
sample_prior = TRUE, chains = 2, cores = 2,
iter = 4000, warmup = 1000
)The variables phylo and species are
identical as they are both identifiers of the species. However, we model
the phylogenetic covariance only for phylo and thus the
species variable accounts for any specific effect that
would be independent of the phylogenetic relationship between species
(e.g., environmental or niche effects). Again we can obtain model
summaries as well as estimates of the phylogenetic signal.
summary(model_repeat1)
hyp <- paste(
"sd_phylo__Intercept^2 /",
"(sd_phylo__Intercept^2 + sd_species__Intercept^2 + sigma^2) = 0"
)
(hyp <- hypothesis(model_repeat1, hyp, class = NULL))
plot(hyp)So far, we have completely ignored the variability of the cofactor within species. To incorporate this into the model, we define
data_repeat$within_spec_cf <- data_repeat$cofactor - data_repeat$spec_mean_cfand then fit it again using within_spec_cf as an
additional predictor.
model_repeat2 <- update(
model_repeat1, formula = ~ . + within_spec_cf,
newdata = data_repeat, chains = 2, cores = 2,
iter = 4000, warmup = 1000
)The results are almost unchanged, with apparently no relationship
between the phenotype and the within species variance of
cofactor.
summary(model_repeat2)Also, the phylogenetic signal remains more or less the same.
hyp <- paste(
"sd_phylo__Intercept^2 /",
"(sd_phylo__Intercept^2 + sd_species__Intercept^2 + sigma^2) = 0"
)
(hyp <- hypothesis(model_repeat2, hyp, class = NULL))Let’s say we have Fisher’s z-transformed correlation coefficients per species along with corresponding sample sizes (e.g., correlations between male coloration and reproductive success):
data_fisher <- read.table(
"https://paul-buerkner.github.io/data/data_effect.txt",
header = TRUE
)
data_fisher$obs <- 1:nrow(data_fisher)
head(data_fisher)We assume the sampling variance to be known and as
for Fisher’s values, where
is the sample size per species. Incorporating the known sampling
variance into the model is straight forward. One has to keep in mind
though, that brms requires the sampling standard
deviation (square root of the variance) as input instead of the variance
itself. The group-level effect of obs represents the
residual variance, which we have to model explicitly in a meta-analytic
model.
model_fisher <- brm(
Zr | se(sqrt(1 / (N - 3))) ~ 1 + (1|gr(phylo, cov = A)) + (1|obs),
data = data_fisher, family = gaussian(),
data2 = list(A = A),
prior = c(
prior(normal(0, 10), "Intercept"),
prior(student_t(3, 0, 10), "sd")
),
control = list(adapt_delta = 0.95),
chains = 2, cores = 2, iter = 4000, warmup = 1000
)A summary of the fitted model is obtained via
The meta-analytic mean (i.e., the model intercept) is with a credible interval of . Thus the mean correlation across species is positive according to the model.
Suppose that we analyze a phenotype that consists of counts instead of being a continuous variable. In such a case, the normality assumption will likely not be justified and it is recommended to use a distribution explicitly suited for count data, for instance the Poisson distribution. The following data set (again retrieved from mpcm-evolution.org) provides an example.
data_pois <- read.table(
"https://paul-buerkner.github.io/data/data_pois.txt",
header = TRUE
)
data_pois$obs <- 1:nrow(data_pois)
head(data_pois)As the Poisson distribution does not have a natural overdispersion
parameter, we model the residual variance via the group-level effects of
obs (e.g., see Lawless, 1987).
model_pois <- brm(
phen_pois ~ cofactor + (1|gr(phylo, cov = A)) + (1|obs),
data = data_pois, family = poisson("log"),
data2 = list(A = A),
chains = 2, cores = 2, iter = 4000,
control = list(adapt_delta = 0.95)
)Again, we obtain a summary of the fitted model via
summary(model_pois)
plot(conditional_effects(model_pois), points = TRUE)Now, assume we ignore the fact that the phenotype is count data and fit a linear normal model instead.
model_normal <- brm(
phen_pois ~ cofactor + (1|gr(phylo, cov = A)),
data = data_pois, family = gaussian(),
data2 = list(A = A),
chains = 2, cores = 2, iter = 4000,
control = list(adapt_delta = 0.95)
)
summary(model_normal)We see that cofactor has a positive relationship with
the phenotype in both models. One should keep in mind, though, that the
estimates of the Poisson model are on the log-scale, as we applied the
canonical log-link function in this example. Therefore, estimates are
not comparable to a linear normal model even if applied to the same
data. What we can compare, however, is the model fit, for instance
graphically via posterior predictive checks.
Apparently, the distribution of the phenotype predicted by the Poisson model resembles the original distribution of the phenotype pretty closely, while the normal models fails to do so. We can also apply leave-one-out cross-validation for direct numerical comparison of model fit.
loo(model_pois, model_normal)Since smaller values of loo indicate better fit, it is again evident
that the Poisson model fits the data better than the normal model. Of
course, the Poisson model is not the only reasonable option here. For
instance, you could use a negative binomial model (via family
negative_binomial), which already contains an
overdispersion parameter so that modeling a varying intercept of
obs becomes obsolete.
In the above examples, we have only used a single group-level effect (i.e., a varying intercept) for the phylogenetic grouping factors. In brms, it is also possible to estimate multiple group-level effects (e.g., a varying intercept and a varying slope) for these grouping factors. However, it requires repeatedly computing Kronecker products of covariance matrices while fitting the model. This will be very slow especially when the grouping factors have many levels and matrices are thus large.
de Villemeruil P. & Nakagawa, S. (2014) General quantitative genetic methods for comparative biology. In: Modern phylogenetic comparative methods and their application in evolutionary biology: concepts and practice (ed. Garamszegi L.) Springer, New York. pp. 287-303.
Hadfield, J. D. & Nakagawa, S. (2010) General quantitative genetic methods for comparative biology: phylogenies, taxonomies, and multi-trait models for continuous and categorical characters. Journal of Evolutionary Biology. 23. 494-508.
Lawless, J. F. (1987). Negative binomial and mixed Poisson regression. Canadian Journal of Statistics, 15(3), 209-225.