Posts Tagged ‘papers’

Reproducibility is hard

9 Sep 2015

Reproducibility is hard. It will probably always be hard, because it’s hard keeping things organized.

I recently had a paper accepted at G3, concerning a huge set of sample mix-ups in a large eQTL study. I’d discovered and worked out the issue back in December, 2010. I gave a talk about it at the Mouse Genetics meeting in Washington, DC, in June, 2011. But for reasons that I will leave unexplained, I didn’t write it up until much later. I did the bulk of the writing in October, 2012, but it wasn’t until February, 2014, that I posted a preprint at arXiv, which I then finally submitted to G3 in June this year.

In writing up the paper in late 2012, I re-did my entire analysis from scratch, to make the whole thing more cleanly reproducible. So with the paper now in press, I’ve placed all of that in a GitHub repository, but as it turned out, there was still a lot more to do. (You can tell, from the repository, that this is an old project, because there are a couple of Perl scripts in there. It’s been a long time since I’ve switched from Perl to Python and Ruby. I still can’t commit to just one of Python or Ruby…want to stick with Python, as everyone else is using it, but much prefer Ruby.)

The basic issue is that the raw data is about 1 GB. The clean version of the data is another 1 GB. And then there are results of various intermediate calculations, some are rather slow to calculate, which take up another 100 MB. I can’t reasonably put all of that within the GitHub repository.

Both the raw and clean data have been posted in the Mouse Phenome Database. (Thanks to Petr Simecek, Gary Churchill, Molly Bogue, and Elissa Chesler for that!) But the data are in a form that I thought suitable for others, and not quite in the form that I actually used them.

So, I needed to write a script that would grab the data files from MPD and reorganize them in the way that I’d been using them.

In working on that, I discovered some mistakes in the data posted to MPD: there were a couple of bugs in my code to convert the data from the format I was using into the format I was going to post. (So it was good to spend the time on the script that did the reverse!)

In addition to the raw and clean data on MPD, I posted a zip file with the 110 MB of intermediate results on figshare.

In the end, I’m hoping that one can clone the GitHub repository and just run make and it will download the data and do all of the stuff. If you want to save some time, you could download the zip file from figshare and unzip that, and then run make.

I’m not quite there, but I think I’m close.

Aspects I’m happy with

For the most part, my work on this project wasn’t terrible.

  • I wrote an R package, R/lineup, with the main analysis methods.
  • That I re-derived the full entire analysis cleanly, in a separate, reproducible document (I used AsciiDoc and knitr) was a good thing.
  • The code for the figures and tables are all reasonably clean, and draw from either the original data files or from intermediate calculations produced by the AsciiDoc document.
  • I automated everything with GNU Make.

What should I have done differently?

There was a lot more after-the-fact work that I would rather not have to do.

Making a project reproducible is easier if the data aren’t that large and so can be bundled into the GitHub repository with all of the code.

With a larger data set, I guess the thing to do is recognize, from the start, that the data are going to be sitting elsewhere. So then, I think one should organize the data in the form that you expect to be made public, and work from those files.

When you write a script to convert data from one form to another, also write some tests, to make sure that it worked correctly.

And then document, document, document! As with software development, it’s hard to document data or analyses after the fact.


Code review

25 Sep 2013

There was an interesting news item in Nature on code review. It describes a project by some folks at Mozilla to review the code (well, really just 200-line snippets) from 6 selected papers in computational biology.

There are very brief quotes from Titus Brown and Roger Peng. I expect that the author of the item, Erika Check Hayden, spoke to Titus and Roger at length but could just include short bits from each, and so what they say probably doesn’t fully (or much at all) characterize their view of the issue.

Titus is quoted as follows, in reference to another scientist who retracted five papers due to an error in his code:

“That’s the kind of thing that should freak any scientist out…. We don’t have good processes in place to detect that kind of thing in software.”

Roger is quoted at the end, as follows:

“One worry I have is that, with reviews like this, scientists will be even more discouraged from publishing their code…. We need to get more code out there, not improve how it looks.”

I agree with both of them, but my initial reaction, from the beginning of the piece, was closer to Roger’s: We often have a heck of time getting any code out of people; if we are too hard on people regarding the quality of their code, they might become even less willing to share.

On the one hand, we want people to produce good code:

  • that works
  • that’s readable
  • that’s reusable

And it would be great if, for every bit of code, there was a second programmer who studied it, verified that it was doing the right thing, and offered suggestions for improvement.

But, on the other hand, it seems really unlikely that journals have the resources to do this. And I worry that a study showing that much scientific software is crap will make people even less willing to share.

I would like to see the code associated with scientific articles made readily available, during the review process and beyond. But I don’t think we (as a scientific community) want to enforce rigorous code review prior to publication.

Later, on twitter, Titus took issue with the “not improve how it looks” part of what Roger said:

“.@kwbroman @simplystats @rdpeng Please read you are deeply, significantly, and completely wrong about code review.”

Characterizing code review as strictly cosmetic was an unfortunate, gross simplification. (And how code looks is important.)

I don’t have enough time this morning to really clean up my thoughts on this issue, and I want to get this out and move on to reading that dissertation that I have to get through by tomorrow. So, let me summarize.


We want scientific code to be well written: does what it’s intended to do, readable, reusable.

We want scientific code to be available. (Otherwise we can’t verify that it does what it’s intended to do, or reuse it.)

If we’re too hard on people for writing substandard code, we’ll discourage the availability. It’s an important trade-off.

“My” chromosome 8p inversion

8 May 2013

There was lots of discussion on twitter yesterday about Graham Coop’s paper with Peter Ralph (or vice versa), on The geography of recent genetic ancestry across Europe, particularly regarding the FAQ they’d created.

I was eager to take a look, and, it’s slightly embarrassing to say, I first did a search to see if they’d made a connection to any of my work. (I’m probably not the only one to do that.) Sure enough, they cited a paper of mine, but it was Giglo et al. (2001) Am J Hum Genet 68: 874–883, on “my” chr 8p inversion, and not what I’d expected, my autozygosity paper.

What did the chr 8p inversion have to do with this? Search for “[36]” and you’ll find:

We find that the local density of IBD blocks of all lengths is relatively constant across the genome, but in certain regions the length distribution is systematically perturbed (see Figure S1), including around certain centromeres and the large inversion on chromosome 8 [36], also seen by [35].

The chr 8p inversion presents an interesting data analysis story from my postdoc years. In a nutshell: I was studying human crossover interference, found poor model fit for maternal chr 8 that was due to tight apparent triple-crossovers in two individuals in each of two families, hypothesized that there was an inversion in the region, but it would have to be both long and with both orientations being common. The inversion was confirmed via FISH, and it’s something like 5 Mbp long, with the frequencies of the two orientations being 40 and 60% in people of European ancestry.


$18 for a two page PDF? I still don’t get it.

2 May 2013

Yesterday, I saw this tweet by @Ananyo

Time that biologists stopped telling the public oversimplistic fairy tales on Darwinian evolution, says P Ball ($)…

So I clicked the link to the Nature paper and realized, “Oh, yeah. I’ve got to enter through the UW library website.”

But then I thought, “Wait…$18 for a two-page Nature comment? WTF?”

So I tweeted:

DNA: Celebrate the unknowns, like this Nature comment, which costs $18.…

And thinking about it some more, I got more annoyed, and tweeted:

Why do publishers charge such high per-article fees? At $18/artcl, you’d have to be desperate or stupid to pay; at $1-2, prob’ly lots would.

And then I thought, I’ll ask Nature directly:

@NatureMagazine Why is the per-article charge so high? It seems like you’d make more profit at $2/article.

And they responded:

@kwbroman For a while now, individual papers can be rented through @readcube for $3-5. A full tablet subscription to Nature costs $35.

But that didn’t quite answer my question. So I asked:

.@NatureMagazine So is the $18 charge for a 2 pg PDF just to discourage piracy?

I thought a lot about whether to put “piracy” in quotes or not, or whether to write “copyright infringement” instead.

But anyway, they responded:

@kwbroman just as with any product, the more you buy, the more you save. Media/publishing subscriptions have worked this way for decades.

That again didn’t quite answer my question.

It’s a scam

I still don’t understand the $18 business. It’s not “The more you buy, the more you save.” It’s, “Buy the whole season for $35, or buy 5 min from Episode 1 for $18.”

I understand that the cover price of Wired is $5 per issue, while I could get a year’s subscription for $15-20. But that’s not the same as $18 for one article vs $200 per year.

The $18 for a two-page PDF is like 900 numbers and paycheck advances. These are scams taking advantage of desperate or stupid people.

If they don’t want to sell the PDFs for individual articles for a reasonable price, they should just not sell them at all.

Methods before results

29 Apr 2013

It’s great that, in a step towards improved reproducibility, the Nature journals are removing page limits on Methods sections:

To allow authors to describe their experimental designs and methods in enough detail for others to interpret and replicate them, the participating journals are removing length restrictions on Methods sections.

But couldn’t they include the Methods section in the pdf for the article? For example, consider this article in Nature Genetics; the Methods section is only available in the html version of the paper. The PDF says:

Methods and any associated references are available in the online version of the paper.

Methods are important.

  • They shouldn’t be separated from the main text.
  • They shouldn’t be placed after the results (as so many journals, including PLoS, do).
  • They shouldn’t be in a smaller font than the main text (as PNAS does).
  • They certainly shouldn’t be endnotes (as Science used to do).

Supplements annoy me too

I love supplemental material: authors can give the full details, and they can provide as many supplemental figures and tables as they want.

But supplements can be a real pain.

  • I don’t want to have to click on 10 different links. Put it all in one document.
  • I don’t want to have to open Word. Put text and figures in a PDF.
  • I don’t want to have to open Excel. Put data in a plain text file, preferably as part of a git repository with related code.

At least supplements are now included at the journal sites!

This paper in Bioinformatics refers to a separate site for supplemental information:

Expression data and supplementary information are available at

But doesn’t exist anymore. I was able to find the supplement using the Wayback Machine, but

  • The link in the paper was wrong: It should be .html not .htm
  • The final version on Wayback has a corrupted PDF, though one can go back to previous versions that are okay.

I like Genetics and G3

Genetics and G3 put the Methods where they belong (before the results), and when you download the PDF for an article in Genetics, it includes the supplement. For a G3 article, the supplement isn’t included in the article PDF, but at least you can the whole supplement as a single PDF.

For example, consider my recent Genetics articles:

If you click on “Full Text (PDF),” you get the article plus the 3 supplemental figures and 23 supplemental tables in the former case, and article plus the 17 supplemental figures and 2 supplemental tables in the latter case.

Towards making my own papers reproducible

10 Mar 2013

Much has been written about reproducible research: that scientific papers should be accompanied by the data and software sufficient to reproduce the results. It’s obviously a Good Thing. But it can be hard to stick to this ideal in practice.

For my early papers, I’m not sure I can find the materials anymore, and that’s just 15 years back.

For my recent papers, I have developed a sort of system so that I can reproduce the results myself. I use a combination tools, including R, Sweave, perl, and, of course, make.

But I’ve not been distributing the detailed code behind my papers. It’s not always pretty, and it is not well documented. (And “not always” is a bit of a stretch — more like “seldom.”)

When Victoria Stodden visited Madison last fall, I was inspired to release this code, but I never carried through on that.

But then last night I twittered an example graph from a paper, was (appropriately) asked to produce code, and so created a github repository with the bulk of the code for that paper.

The repository is incomplete: it doesn’t include the code to do the main analysis and simulations, but just to make the figures, starting from those results. I’ll work to add those additional details.

And, even once complete, it will be far from perfect. The code is (or will be) there, but it would take a bit of work for an interested reader to figure out what it is doing, since much of it is undocumented and poorly written.

But if we ask for perfection, we’ll get nothing. If we ask for the minimal materials for reproducibility, we may get it.

So that’s my goal: to focus first on minimal accessibility of the code and data behind a paper, even if it is minimally readable and so might take quite a bit of effort for someone else to follow.

One last point: I use local git repositories for my draft analyses and for the whole process of writing a paper. I could post that whole history, but as I said before:

Open source means everyone can see my stupid mistakes. Version control means everyone can see every stupid mistake I’ve ever made.

It would be easy to make my working repository public, but it would include things like referees’ reports and my responses to them, as well as the gory details on all of the stupid things that I might do along the way to publication.

I’m more comfortable releasing just a snapshot of the final product.

Wired on statistics

23 Oct 2012

Wired magazine had a short article last month on how to read a scientific report, which basically was about how to interpret the statistical results. It’s reasonably well done.

Odd smoother

27 Sep 2012

I was flipping though the contents of G3 and saw this paper, which includes Jim Cheverud as an author.

I poked through the paper, and my eye was attracted to the following figure. These are supposed to be functions, but some weird smoother was used, and the results are not functions.


17 May 2012

I’m frequently picking pieces of paper up off the floor — bits of the kids’ artwork. Today I found one with this:

That’s actually mine, so I thought I’d share it. Here’s a link to the full page.

These scratchings are related to a bit of theory I’ve been working on (with considerable help), regarding phylogenetic trees. It’s part of this paper, which I’m getting ready to re-submit.

Collaborative Cross papers

16 Feb 2012

A batch of 15 papers on the mouse Collaborative Cross appeared in Genetics and G3 today, including two of my own: