Archive for April, 2012

Positive comments on peer review

27 Apr 2012

We all complain about peer review, particularly when our best work is rejected by every journal from Nature Genetics down to that journal that will publish anything, so that it finally appears in a volume to honor some guy that only he will read.

However, sometimes an anonymous reviewer will identify an important flaw in a paper that you can fix before it’s published, thus saving you from potential public embarrassment.

That happened to me again today. I finally got the reviews back for a paper, eight weeks after it was submitted. I had become a bit impatient, but one of the reviewers identified a hole in our theory section, which we can now fix before publication (I think we figured it out this afternoon), thus avoiding public embarrassment, except for the fact that I’m currently pointing it out publicly.

Complaints about the peer review process are not unlike a common complaint about statisticians: that we are a barrier to scientists publishing what they know to be true. That is sometimes the case, but at other times, both reviewers and statisticians can help you to avoid embarrassing yourself.

Microarrays suck

25 Apr 2012

Maybe it’s just that I’m stupid and haven’t been paying proper attention to Rafa’s work in the past decade, but microarray data have really been kicking my ass the last few weeks. Here are a few of the lessons that I’ve re-learned.

Lesson 1: What the hell is that?

Many (or all) of my interesting findings have been completely fortuitous results of exploring apparent artifacts in data (autozygosity in the CEPH families, a large, common human inversion polymorphism, and most recently sample mixups).

A couple of weeks ago, I was working on writing up a paper, and saw an unusual plot, and thought, “Hmm, that’s interesting,” and so I made some more plots. And then I saw this:

This is an image of the correlations between samples, across genes, for a study with 500 microarrays on each of 6 tissues. The samples are sorted by sex (female on bottom left; male on top right) and then by ID. I was initially puzzling over the sex difference in other tissues (same-sex pairs are positively correlated; opposite-sex pairs are negatively correlated), but when I saw this tissue, I was struck by the plaid pattern.

Following up on this, it seems that the plaid pattern is the result of a set of bad arrays that should have been but hadn’t been detected before.

Lesson 2: Look

Of course, I wouldn’t have found this array problem if I hadn’t looked. And in my several years working on this project, I hadn’t looked.

I was overwhelmed by the 500 microarrays × 6 tissues × 40,000 probes, and so didn’t make the sorts of plots that I should have.

Plus, this is the first real microarray project I’ve been involved in, and I haven’t known what to look for. (I should have looked at images of correlation matrices with arrays sorted by similarity as determined from hierarchical clustering.)

You can’t make 3000 choose 2 scatterplots, but you can look at summaries of the marginal distributions of the 3000 arrays. I was thinking about how to depict 500 box plots, and came up with this:

There are ~500 arrays here, with the lines indicating quantiles: 1, 5, 10, 25, 50, 75, 90, 95, 99%iles. The first batch of arrays are the bad ones. You can see that those have a shift upward in median but also a heavy lower tail.

I quite like the picture.

The code is available within my R/broman package on github, though there’s not much to it.

Lesson 3: Don’t trust anyone

Don’t even trust yourself.

I relied on others to do the data cleaning. I should have checked things more carefully.

I looked at the data quite closely a year or so ago and detected sample mix-ups. In correcting those mistakes, I spent a lot of time looking at the array data and trying to identify bad arrays. Clearly I didn’t look closely enough, or I just looked at the wrong things.

Lesson 4: Take your time

Scientists are under a lot of pressure to produce and often are not exploring data as thoroughly as the data deserve.

But if you rush through data diagnostics, you could embarrass yourself later.

More importantly, take the time to follow up on apparent artifacts. For me, the unintended analyses are more interesting than the planned ones.

My basic strategy for data diagnostics

  • Expect errors
  • Think about what might have gone wrong and how such problems might be revealed
  • Make lots of specially-tailored plots
  • Create a (potentially informal) pipeline of all of the useful diagnostics you’ve used in the past
  • Make more plots just because you think they might interesting
  • If something looks odd, figure out why

ENAR 2012

18 Apr 2012

The annual ENAR meeting (eastern North America portion of International Biometric Society) was a few weeks ago in Washington, DC. It was great to see old friends, and I learned a number of things outside of the sessions, but mostly I was annoyed by the meeting.

Ways to annoy me

  • Distorted aspect ratios: I’m often seeing LCD projectors set up in wide format, stretching a presentation that was developed in the old 4:3 style. Is it not obvious that preserving the aspect ratio is more important than filling the screen?
    • This is especially bad with photographs:


    • It also makes fonts ugly:


  • Outlines for talks (especially 15 min talks): If you have only 15 min, don’t waste any time telling us what you’re going to be telling us; just tell us. And even if you’ve got 45 min, I find the “Background, Methods, Simulations, Application, Discussion” outline totally useless, and anything more complicated than that seldom makes sense until the speaker is part way through the talk. Why try to explain terminology before you’ve gotten to the background section?
  • “I’m running out of time so I’m going to skip the real data analysis and go quickly through some asymptotic results.” Someone actually said that.
  • Opening night poster session (and for three hours!): It actually seemed to be working, but I sure wouldn’t want to be standing next to a poster from 8-11pm. I would prefer:
    • posters available to look at throughout the meeting
    • multiple poster sessions (so presenters have some opportunity to talk to each other)
    • nothing happening at the time of a poster session
  • Dull talks by famous people: (I’m not talking about you, fine reader, but the other famous people.) Biology meetings will have a few invited speakers but the bulk of the talks will be chosen based on submitted abstracts. Statistics meetings seem more often arranged to have people submit proposals for full sessions, with a slate of pre-selected speakers, and it is those sessions that are reviewed. That seems the perfect design if you want crappy talks by famous people. Perhaps I’m annoyed only because I’m a non-famous person who can give a good talk.