Linear mixed models (LMMs) have become widely used for dealing with population structure in human GWAS, and they’re becoming increasing important for QTL mapping in model organisms, particularly for the analysis of advanced intercross lines (AIL), which often exhibit variation in the relationships among individuals.

In my efforts on R/qtl2, a reimplementation R/qtl to better handle high-dimensional data and more complex cross designs, it was clear that I’d need to figure out LMMs. But while papers explaining the fit of LMMs seem quite explicit and clear, I’d never quite turned the corner to actually seeing how I’d implement it. In both reading papers and studying code (e.g., lme4), I’d be going along fine and then get completely lost part-way through.

But I now finally understand LMMs, or at least a particular, simple LMM, and I’ve been able to write an implementation: the R package lmmlite.

It seemed worthwhile to write down some of the details.

The model I want to fit is *y = X b + e*, where var(*e*) = *sK + tI*, where *K* is a known kinship matrix and *I* is the identity matrix. Think of *y* as a vector of phenotypes and *X* as a matrix of covariates. Let *v = s+t* be the residual variance, and let *h = s/(s+t) = s/v* be the heritability.

First, a shout to Artem Tarasov, who wrote a series of blog posts walking through and explaining the source code for FaST-LMM and pylmm, and to Nick Furlotte, whose pylmm code is especially clear and easy-to-read. Only by reading their work did I come to understand these LMMs.

Back to the model fit:

- For a fixed value of the heritability,
*h*, we have var(*e*) =*v[hK + (1-h)I] = vV*where*V*is known. And so we end up with a general least squares problem, which we can fit in order to estimate*b*and*v*. - And actually, if you take the eigen decomposition of
*K*, say*K = UDU’*, it turns out that you can write*hK + (1-h)I = hUDU’ + (1-h)UU’ = U[hD + (1-h)I]U’*. That is, the eigenvectors of*K*are the same as the eigenvectors of*hK + (1-h)I*. And so if you pre-multiply*y*and*X*by*U’*, you end up with a weighted least squares problem, which is way faster to fit than a general least squares problem. - Having fit the weighted least squares problem to estimate
*b*and*v*, you can then calculate the corresponding log likelihood (or, better, the restricted log likelihood, if you want to do REML). - You’re then left with a one-dimensional optimization problem (optimizing the log likelihood over
*h*), which you can solve by Brent’s method. - That’s it!

It seems quite obvious in retrospect. It’s a bit embarrassing that it’s taken me so long to come to this understanding.

In lmmlite, I implemented this algorithm (closely following the code in pylmm) twice: in plain R, and then in C++ (using RcppEigen, which is an interface to the Eigen C++ linear algebra library). The plain R code is a bit slower then pylmm; the C++ code is a bit faster. In the C++ code, almost all of the computation time is devoted to the eigen decomposition of the kinship matrix. Once that’s done, the rest is super quick.

Tags: mixed models, programming, QTL mapping, R

25 Nov 2015 at 11:22 am |

[…] Fitting linear mixed models for QTL mapping Linear mixed models (LMMs) have become widely used for dealing with population structure in human GWAS, and they’re becoming increasing important for QTL mapping in model organisms, particularly for the analysis of advanced intercross lines (AIL), which often exhibit variation in the relationships among individuals. […]

7 Apr 2016 at 11:13 pm |

Karl,

I’m not clear how the eigendecomposition saves time. Doesn’t it take about as long as the general least squares problem (and longer if V has a sparse inverse to speed up GLS)?

I can see it would help for multiple phenotypes or multiple sets of covariates where you could re-use the same eigenvectors.

7 Apr 2016 at 11:45 pm |

Yes; the advantage is for the case where you’re doing the same thing with multiple phenotypes and especially multiple possible covariates (genetic markers across the genome) using the same eigenvectors.

8 Apr 2016 at 4:58 am |

Ok, that makes sense. I’m used to multiple markers being done with score tests rather than really fitting the model, but that’s in humans where the associations tend to be really weak.

18 Nov 2016 at 3:09 pm |

I was thinking of taking your C++ code for LMM (I think available on GitHub) to create a MEX file in matlab as I do most of my coding there. Do you think there will be any obvious issues with this? I too have population structure issues (lots of sibs in sudo-AIL cross). Many thanks for your description – very helpful.

18 Nov 2016 at 4:35 pm |

That would be great; you just need to pay attention to the requirements of the license (GNU Affero GPL), which largely means distributing the source and not just compiled code, and having your code also released under Affero GPL. I’m not totally clear on the “Affero” part, but I understand it to mean that if the code is made available via some web server, the source needs to be made accessible, too.

18 Nov 2016 at 4:36 pm |

(But note that the lmmlite code is more of a “toy”, written just for me to polish my understanding of how to fit these particular LMMs.)