Archive for the ‘Genetics’ Category

Fitting linear mixed models for QTL mapping

24 Nov 2015

Linear mixed models (LMMs) have become widely used for dealing with population structure in human GWAS, and they’re becoming increasing important for QTL mapping in model organisms, particularly for the analysis of advanced intercross lines (AIL), which often exhibit variation in the relationships among individuals.

In my efforts on R/qtl2, a reimplementation R/qtl to better handle high-dimensional data and more complex cross designs, it was clear that I’d need to figure out LMMs. But while papers explaining the fit of LMMs seem quite explicit and clear, I’d never quite turned the corner to actually seeing how I’d implement it. In both reading papers and studying code (e.g., lme4), I’d be going along fine and then get completely lost part-way through.

But I now finally understand LMMs, or at least a particular, simple LMM, and I’ve been able to write an implementation: the R package lmmlite.

It seemed worthwhile to write down some of the details.


“My” chromosome 8p inversion

8 May 2013

There was lots of discussion on twitter yesterday about Graham Coop’s paper with Peter Ralph (or vice versa), on The geography of recent genetic ancestry across Europe, particularly regarding the FAQ they’d created.

I was eager to take a look, and, it’s slightly embarrassing to say, I first did a search to see if they’d made a connection to any of my work. (I’m probably not the only one to do that.) Sure enough, they cited a paper of mine, but it was Giglo et al. (2001) Am J Hum Genet 68: 874–883, on “my” chr 8p inversion, and not what I’d expected, my autozygosity paper.

What did the chr 8p inversion have to do with this? Search for “[36]” and you’ll find:

We find that the local density of IBD blocks of all lengths is relatively constant across the genome, but in certain regions the length distribution is systematically perturbed (see Figure S1), including around certain centromeres and the large inversion on chromosome 8 [36], also seen by [35].

The chr 8p inversion presents an interesting data analysis story from my postdoc years. In a nutshell: I was studying human crossover interference, found poor model fit for maternal chr 8 that was due to tight apparent triple-crossovers in two individuals in each of two families, hypothesized that there was an inversion in the region, but it would have to be both long and with both orientations being common. The inversion was confirmed via FISH, and it’s something like 5 Mbp long, with the frequencies of the two orientations being 40 and 60% in people of European ancestry.


Interactive eQTL plot with d3.js

6 Mar 2013

I just finished an interactive eQTL plot using D3, in preparation for my talk on interactive graphics at the ENAR meeting next week.

Static view of interactive eQTL plot

The code (in CoffeeScript) is available at github. But beware: it’s pretty awful.

The hardest part was setting up the data files. Well, that plus the fact that I just barely know what I’m doing in D3.

The future of personalized medicine

11 Oct 2012

“Scientists” ripping people off by selling them basically useless genetic information with a bullshit report: a genetic test for exercise.

Odd smoother

27 Sep 2012

I was flipping though the contents of G3 and saw this paper, which includes Jim Cheverud as an author.

I poked through the paper, and my eye was attracted to the following figure. These are supposed to be functions, but some weird smoother was used, and the results are not functions.


17 May 2012

I’m frequently picking pieces of paper up off the floor — bits of the kids’ artwork. Today I found one with this:

That’s actually mine, so I thought I’d share it. Here’s a link to the full page.

These scratchings are related to a bit of theory I’ve been working on (with considerable help), regarding phylogenetic trees. It’s part of this paper, which I’m getting ready to re-submit.

Genetic testing of sperm donors

16 May 2012

A recent New York Times article discusses the potential need for genetic screening of sperm donors. A couple of points struck me as rather odd.

One, the mention of fragile X syndrome, has been corrected:

An earlier version of this article erroneously included fragile X syndrome in a list of genetic diseases that have affected children conceived with donated sperm. (Although a sperm donor who is a carrier of the fragile X premutation may pass it along to his daughters, the disease, which affects twice as many boys as girls, is almost always inherited from the mother.)

The other:

Sperm donors are no more likely to carry genetic diseases than anybody else, but they can father a far greater number of children: 50, 100 or even 150, each a potential inheritor of flawed genes, and each a vector for making those genes more pervasive in the general population.

But the frequency of offspring of sperm donors with genetic diseases would be the same whether there were many more donors with one child each or fewer donors with many children each, and so it’s not clear how sperm donors fathering many children is an argument for genetic testing, except that with fewer donors for the same number of offspring, one would not need to perform as many genetic tests, and so there would be a considerable cost savings.

The article led with a story about a couple whose child, fathered by a sperm donor, has cystic fibrosis. But the chance of that is the same whether or not the sperm came from a donor or in the usual way.

Rob Tibshirani and Andy Clark named to NAS

3 May 2012

Rob Tibshirani and Andy Clark are now members of the National Academy of Sciences.

Microarrays suck

25 Apr 2012

Maybe it’s just that I’m stupid and haven’t been paying proper attention to Rafa’s work in the past decade, but microarray data have really been kicking my ass the last few weeks. Here are a few of the lessons that I’ve re-learned.

Lesson 1: What the hell is that?

Many (or all) of my interesting findings have been completely fortuitous results of exploring apparent artifacts in data (autozygosity in the CEPH families, a large, common human inversion polymorphism, and most recently sample mixups).

A couple of weeks ago, I was working on writing up a paper, and saw an unusual plot, and thought, “Hmm, that’s interesting,” and so I made some more plots. And then I saw this:

This is an image of the correlations between samples, across genes, for a study with 500 microarrays on each of 6 tissues. The samples are sorted by sex (female on bottom left; male on top right) and then by ID. I was initially puzzling over the sex difference in other tissues (same-sex pairs are positively correlated; opposite-sex pairs are negatively correlated), but when I saw this tissue, I was struck by the plaid pattern.

Following up on this, it seems that the plaid pattern is the result of a set of bad arrays that should have been but hadn’t been detected before.

Lesson 2: Look

Of course, I wouldn’t have found this array problem if I hadn’t looked. And in my several years working on this project, I hadn’t looked.

I was overwhelmed by the 500 microarrays × 6 tissues × 40,000 probes, and so didn’t make the sorts of plots that I should have.

Plus, this is the first real microarray project I’ve been involved in, and I haven’t known what to look for. (I should have looked at images of correlation matrices with arrays sorted by similarity as determined from hierarchical clustering.)

You can’t make 3000 choose 2 scatterplots, but you can look at summaries of the marginal distributions of the 3000 arrays. I was thinking about how to depict 500 box plots, and came up with this:

There are ~500 arrays here, with the lines indicating quantiles: 1, 5, 10, 25, 50, 75, 90, 95, 99%iles. The first batch of arrays are the bad ones. You can see that those have a shift upward in median but also a heavy lower tail.

I quite like the picture.

The code is available within my R/broman package on github, though there’s not much to it.

Lesson 3: Don’t trust anyone

Don’t even trust yourself.

I relied on others to do the data cleaning. I should have checked things more carefully.

I looked at the data quite closely a year or so ago and detected sample mix-ups. In correcting those mistakes, I spent a lot of time looking at the array data and trying to identify bad arrays. Clearly I didn’t look closely enough, or I just looked at the wrong things.

Lesson 4: Take your time

Scientists are under a lot of pressure to produce and often are not exploring data as thoroughly as the data deserve.

But if you rush through data diagnostics, you could embarrass yourself later.

More importantly, take the time to follow up on apparent artifacts. For me, the unintended analyses are more interesting than the planned ones.

My basic strategy for data diagnostics

  • Expect errors
  • Think about what might have gone wrong and how such problems might be revealed
  • Make lots of specially-tailored plots
  • Create a (potentially informal) pipeline of all of the useful diagnostics you’ve used in the past
  • Make more plots just because you think they might interesting
  • If something looks odd, figure out why

Collaborative Cross papers

16 Feb 2012

A batch of 15 papers on the mouse Collaborative Cross appeared in Genetics and G3 today, including two of my own: