Mind the Gap: Mixed Models in R
What should be done about the sprawling mixed model ecosystem in R?
This is adapted from a talk I gave at NCCC-170 annual meeting June 16, 2022.
0.1 Overview of the Problem
The mixed model situation in R is a hot mess (that is, it is spectacularly disordered).
The linear mixed model ecosystem in R consists of over 80 libraries that either construct and solve mixed model equations or helper packages the process the results from mixed model analysis (this list was derived from a similar list created by Ben Bolker and others). These libraries provide a patchwork of overlapping and unique functionality regarding the fundamental structure of mixed models: allowable distributions, nested and crossed random effects, heterogeneous error structures and other facets. No single library has all possible functionality enabled.
This patchwork of packages makes it very challenging for statisticians to conduct mixed model analysis and to teach others how to run mixed models in R. As a statistical consultant, my job is to help agricultural researchers develop statistical models that match their research aims and implement those models in R or another statistical software. My experience after 4 year of consulting is that many researchers have a hard time time sorting out how to correctly specify their statistical model in R.
Please note that none of this is intended as a criticism of current or past package authors. Package authors and maintainers give a tremendous amount of time and effort - often for free - to provide the functionality present in a package. It is not the fault of any individual that the R mixed model ecosystem is so challenging. It is likely more a reflection of how challenging mixed models are to properly specify and implement and the overall flexibility of these models.
But alas, here we are. Let’s explore the specific problems with running mixed models in R.
0.2 The Issues
(arranged in order how of much grief I think each causes for R users)
Problem: dispersed functionality
Table of a some example functionality and the packages where it is enabled. This is not a comprehensive list of possible functionality in mixed models or what it implemented in R.
Package | Functionality |
---|---|
lme4 | crossed random effects |
nlme | exp, gaus, lin & pow covariance structures |
spaMM | Matérn covariance structure |
lme4 | generalised LMM |
nlme | compound symmetry, AR1, ARMA covariance structures |
spATS | spline-based spatial effects |
lmerGS, lme4qtl, rrBLUP, pedigreemm, asreml, sommer, … | kinship relationship matrix (covariance structure) |
The most popular mixed model package is likely lme4, perhaps followed by nlme (it’s difficult to quantify this because downloads don’t always reflect usage, and since nlme is now part of base R, download stats for that are really wonky).
These packages by themselves can good aims, but the problems arise in complex situations that require combining functionality. While lme4 is very powerful (and fast!), it lacks essential functionality for including spatial covariates. nlme does have that functionality (although not for Matérn, arguable one of the most popular spatial covariance structures), but it is difficult (but not impossible) to specify crossed random effect (they are nested by default in lme()
calls). There are several packages that extend lme4 but to permit inclusion of a kinship based covariance matrix, but combining that with spatial covariates is not possible. And of course, lme4 famously does not compute p-values, a limitation that can be overcome by computing them manually (fun!!!) or using lmerTest. But, often users can only use one lme4 package extender at a time, leaving them to choose which functionality they need most. And while I agree with the spirit of why p-values are not included in the default ANOVA output for lme4, this is rarely a persuasive argument for my clients, who want p-values come hell or high water.
Problem: inconsistent syntax
Inconsistent syntax is a long standing problem with R and mixed models or any modelling package in R has this problem (and for all those who say “use Tidymodels!” - agreed, and this is addressed later in the post).
Imagine we want to construct a linear mixed model for one fixed effect, x
, and one random effect, rep
, for a dependent variable, y
. These variables are in a data.frame called “mydata” where y
and x
are numeric, while rep
is an unordered factor.
The model:
\[y_{ij} = x_i + rep_j + \epsilon_{ij}\]
where the error terms, \(\epsilon_{ij}\) are distributed as thus:
\[\epsilon_{ij} \sim N(0, \sigma_{error})\] and \(rep_j\) is distributed as:
\[rep_j \sim N(0, \sigma_{rep})\]
Below are syntax examples for this model in 4 major mixed modelling packages in R.
lme4
Perhaps the most widely used package for mixed modelling in R. Considered (by me) to be the successor to nlme.
lme4::lmer(y ~ x + (1|rep), data = mydata,
na.action = na.exclude, REML = TRUE)
nlme
Standard mixed modelling package in R, now part of base R and still widely used (spoiler: I love this package!)
nlme::lme(y ~ x, random = ~ 1|rep, data = mydata,
na.action = na.exclude, method = “REML”)
This follow very similar syntax to lme4, but the more arguments specified/the more complex things get, the more the syntax diverges (e.g. how weights matrices are specified is rather different between the packages).
asreml
A very popular and widely used software in agricultural sciences. It uses a formula interface similar to nlme and lme4, with slightly different approaches to specifying how to handle missing data and special covariance matrices. The package sommer uses a very similar syntax style. Note that asreml is not open source and requires an annual license to use.
asreml::asreml(y ~ x, random = ~ rep, data = mydata,
na.method.X = 'exclude',
na.method.Y = ‘exclude’)
rrBLUP
A popular package in the plant breeding community. This package does not use a formula object, and instead requires users to specify the independent effects as a design matrix. The other arguments do not follow existing conventions for linear models in R.
rrBLUP::mixed.solve(y = mydata$y, x = {fixed_design_matrix},
z = {random_design_matrix}, SE = TRUE)
I talk more about Tidymodels below, but if you just can’t wait, tldr: I have not observed many of my clients or other agricultural scientists using Tidymodels for mixed modelling and even if they did, Tidymodels only addresses a very small number of packages.
Problem: varied output
There are so many special rules to follow.
- Use lmerTest to output p-values for lme4 (but this doesn’t work with lme4GS or other add-on packages).
- emmeans can extract fixed effects, but not BLUPs! (But, marginaleffects can calculate BLUPs! 🤩🤩🤩) Instead, check the package for package functions (fingers crossed that BLUP calculation is enabled!) and deal with the wildly varying syntax.
- Not all these packages have methods written to extract output, so you have to hunt through the model object or the summary of the model object. This is annoying (and for many, it can be very mystifying) and results in very brittle code.
- The AR1xAR1 covariance structure (kronecker product of AR1 in two dimensions, each with its own \(\rho\)) is implemented in sommer, but it does not estimate \(\rho\); breedR can do that (but it’s not open source); INLA can estimate \(\rho\), but the package uses a Bayesian approach and takes awhile to install and learn.
- …and so much more.
Unsurprisingly, people have a hard time navigating this!
So, users end up implementing sub-optimal solutions (e.g. running aov()
, equivalent to PROC ANOVA
in SAS, when a mixed model is needed). I can (usually) help them, but is this the right long-term solution? Once these clients leave my office, are they ready to implement different mixed models on their own? And what about the people who never consult with a statistician? Mixed models are a cornerstone of agricultural research (and likely academic research overall), yet the tools available to do this correctly are insufficient.
0.3 Existing Solutions
There have been recent efforts to create a common function call for mixed models with the library multilevelmod (part of the Tidymodels family) and a common way to access accessory information with broom.mixed. These umbrella libraries are enormously helpful, but they are incomplete, addressing less than 10 mixed models packages in total.
multilevelmod package
This is part of the Tidymodels family, whose goal (among many) is to standardize the syntax for mixed modelling in the R-verse. I am so glad this package exists; it has been sorely needed for years and does simplify the syntax. However, it addresses 4 packages and an extremely limited number of scenarios:
package | linear | logistic | Poisson |
---|---|---|---|
lme4 | X | X | X |
nlme | X | ||
gee | X | X | X |
rstanarm | X | X | X |
Again, these package authors are busy, and this task of unifying the mixed model ecosystem is difficult, so no shade intended towards these folks! My point is that implementing a mixed model is still rather challenging from from the user standpoint.
broom.mixed package
This is a fantastic package that supports a large number of packages (see below). The highlighted packages are those also supported by multilevelmod.
- brms
- gamlss
- GLMMadaptive
- glmmADMB
- lme4
- mcmc
- MCMCglmm
- mediation
- nlme
- pscl
- rstanarm
- TMB
multilevelmod and broom.mixed work like their counterparts in the Tidymodels world. I don’t want to go into the syntax, but you can find examples here and here.
Many accessory packages:
There are loads of accessory packages written to ease the burden of extracting output from a mixed model object and/or extending the functionality. These add to the general complexity of specifying a mixed model in R and hence are both a solution and part of the problem.
It’s a big task to name all helper packages for mixed models that I have no interest in doing comprehensively, so here is a short list:
- afex: conduct ANOVA
- emmeans and marginaleffects: 2 helpful packages for extracted BLUE’s and in in the case of marginaleffects, BLUPs.
- lmerTest: get those p-values from lme4!
There are more (see the google sheets links listed above), but wow, keeping track of this is a major undertaking.
Tidymodels style guide
- Naming conventions for arguments
- How information should be returned in ‘predict’ calls, ‘summary’ statements
- what the user-facing function should look like
- usage of S3 method conventions
- how parallel processing occurs
- Tidyverse style guide (opinionated)
Missing from this guide are conventions on the standardized output for (mixed) linear models (log likelihood, residuals, variance components, etc). These things are “missing” because the Tidymodels guide is broadly written for a diverse group of packages/modelling applications in which the standard output from a linear model is not possible or desired. However, in the mixed model landscape, there is common output that users expect; hence it would be useful to codify standards for these types of models.
0.4 What else Is needed?
Unfortunately none of these solutions solve one of the most bedeviling problems: running a complex mixed model in R requires using a large patchwork of packages (many of which are not part of Tidymodels) and in some cases, these different functionalities cannot be implemented together.
0.5 What else can we do?
- Teach Tidymodels to our clients. I admit to not doing this, but I should be doing this.
- Encourage teaching of Tidymodels framework at our institutions.
- Write an opinionated white paper about recommended functionality and style for any new mixed modelling packages written for R (Who wants more things to do??? And will anyone even read this, let alone follow it?)
- Contribute to multilevelmod and broom.mixed to expand the packages supported. Important Note: doing this starts with visiting the respective Github repo’s for these packages and finding out what the package authors & maintainers want.
Questions to ask ourselves:
- Which is the most important barriers we should address: standardized syntax? standardized output? something else?
- What mixed model packages are the most important to the user base?
- Seriously, who is going to do this? Is this grant proposal worthy?
- What about the functionality patchwork problem? What on this green earth should we do about that?
Other ideas? (I want to hear them! DM me on Twitter: @SeedsAndBreeds)
0.6 Final Thoughts
I decided to write this after years of trying to make mixed models work for my clients and often seeing them frustrated. Perhaps our applications seem niche - but, there are thousands of agricultural researchers in the world. And questions about how to build a linear mixed model and conduct ANOVA are by far the most common FAQ I encounter. Common scenarios in this subject include:
- A multilevel field trial (imagine 3-5 treatments, each with 2-10 levels, perhaps one a time series, because ag scientists love those complex experiments!). In many cases, we need to account for spatial variation.
- A field trial with one fixed effect and one random effect, but two different heterogeneous covariance matrices need to be included: one for spatial variation and another for relatedness among the treatments.
- All of the previous experiments plus special designs: split plot, split-split plot, Latin square, incomplete block, augmented design….
- All the previous experiments, but now we have a beta-distributed variable! Or a negative binomial (the experimental variables I see rarely meet the expectations of a Poisson distribution).
- A mixed model with tons of zeros! And maybe a time series component, as well.
There are ways to solve these problems–easily–in other (paid) software such as SAS or asreml-R. But for whatever reasons that I am not in control of, people want to use R and the non-proprietary libraries that accompany it. And often they end up implementing sub-optimal solutions based on a lack of understanding of how to properly specify a mixed model in R.
Should we abandon users to the confusing and incomplete maze of mixed models in open-source R or should we help them and ultimately create a more sustainable mixed model ecosystem?
the R mixed model ecosystem https://t.co/wx5nLM1soc
— ag stats stan (@SeedsAndBreeds) July 3, 2022
A video of this talk can be viewed on youtube.