Posts Tagged ‘reproducible research’

Reproducibility is hard

9 Sep 2015

Reproducibility is hard. It will probably always be hard, because it’s hard keeping things organized.

I recently had a paper accepted at G3, concerning a huge set of sample mix-ups in a large eQTL study. I’d discovered and worked out the issue back in December, 2010. I gave a talk about it at the Mouse Genetics meeting in Washington, DC, in June, 2011. But for reasons that I will leave unexplained, I didn’t write it up until much later. I did the bulk of the writing in October, 2012, but it wasn’t until February, 2014, that I posted a preprint at arXiv, which I then finally submitted to G3 in June this year.

In writing up the paper in late 2012, I re-did my entire analysis from scratch, to make the whole thing more cleanly reproducible. So with the paper now in press, I’ve placed all of that in a GitHub repository, but as it turned out, there was still a lot more to do. (You can tell, from the repository, that this is an old project, because there are a couple of Perl scripts in there. It’s been a long time since I’ve switched from Perl to Python and Ruby. I still can’t commit to just one of Python or Ruby…want to stick with Python, as everyone else is using it, but much prefer Ruby.)

The basic issue is that the raw data is about 1 GB. The clean version of the data is another 1 GB. And then there are results of various intermediate calculations, some are rather slow to calculate, which take up another 100 MB. I can’t reasonably put all of that within the GitHub repository.

Both the raw and clean data have been posted in the Mouse Phenome Database. (Thanks to Petr Simecek, Gary Churchill, Molly Bogue, and Elissa Chesler for that!) But the data are in a form that I thought suitable for others, and not quite in the form that I actually used them.

So, I needed to write a script that would grab the data files from MPD and reorganize them in the way that I’d been using them.

In working on that, I discovered some mistakes in the data posted to MPD: there were a couple of bugs in my code to convert the data from the format I was using into the format I was going to post. (So it was good to spend the time on the script that did the reverse!)

In addition to the raw and clean data on MPD, I posted a zip file with the 110 MB of intermediate results on figshare.

In the end, I’m hoping that one can clone the GitHub repository and just run make and it will download the data and do all of the stuff. If you want to save some time, you could download the zip file from figshare and unzip that, and then run make.

I’m not quite there, but I think I’m close.

Aspects I’m happy with

For the most part, my work on this project wasn’t terrible.

  • I wrote an R package, R/lineup, with the main analysis methods.
  • That I re-derived the full entire analysis cleanly, in a separate, reproducible document (I used AsciiDoc and knitr) was a good thing.
  • The code for the figures and tables are all reasonably clean, and draw from either the original data files or from intermediate calculations produced by the AsciiDoc document.
  • I automated everything with GNU Make.

What should I have done differently?

There was a lot more after-the-fact work that I would rather not have to do.

Making a project reproducible is easier if the data aren’t that large and so can be bundled into the GitHub repository with all of the code.

With a larger data set, I guess the thing to do is recognize, from the start, that the data are going to be sitting elsewhere. So then, I think one should organize the data in the form that you expect to be made public, and work from those files.

When you write a script to convert data from one form to another, also write some tests, to make sure that it worked correctly.

And then document, document, document! As with software development, it’s hard to document data or analyses after the fact.

Initial steps towards reproducible research

4 Dec 2014

In anticipation of next week’s Reproducible Science Hackathon at NESCent, I was thinking about Christie Bahlai’s post on “Baby steps for the open-curious.”

Moving from Ye Olde Standard Computational Science Practice to a fully reproducible workflow seems a monumental task, but partially reproducible is better than not-at-all reproducible, and it’d be good to give people some advice on how to get started – to encourage them to get started.

So, I spent some time today writing another of my minimal tutorials, on initial steps towards reproducible research.

It’s a bit rough, and it could really use some examples, but it helped me to get my thoughts together for the Hackathon and hopefully will be useful to people (and something to build upon).

Yet another R package primer

28 Aug 2014

Hadley Wickham is writing what will surely be a great book about the basics of R packages. And Hilary Parker wrote a very influential post on how to write an R package. So it seems like that topic is well covered.

Nevertheless, I’d been thinking for some time that I should write another minimal tutorial with an alliterative name, on how to turn R code into a package. And it does seem valuable to have a diversity of resources on such an important topic. (R packages are the best way to distribute R code, or just to keep track of your own personal R code, as part of a reproducible research process.)

So I’m going ahead with it, even though it doesn’t seem necessary: the R package primer.

It’s not completely done, but the basic stuff is there.

knitr in a knutshell tutorial

6 Feb 2014

I spent a lot of time this week writing a short tutorial on knitr: knitr in a knutshell.

This is my third little tutorial. (The previous ones were a git/github guide and a minimal make tutorial.)

I’m pleased with these tutorials. In learning new computing skills, it can be hard to get started. My goal was to provide the initial motivation and orientation, and then links to other resources. I think they are effective in that regard.

I’ve gotten really excited about the tools for reproducible research. I think the main reason that statisticians have been so slow to adopt a reproducible workflow is a lack of training.

I’m hoping that these tutorials, plus the other materials that I’m putting together over the course of this semester, for my Tools for Reproducible Research course, will help.

But take a look at the material that Jeff, Roger, and Brian are developing for their Data Science MOOCs; you’ll see that mine are pretty humble contributions.

Code review

25 Sep 2013

There was an interesting news item in Nature on code review. It describes a project by some folks at Mozilla to review the code (well, really just 200-line snippets) from 6 selected papers in computational biology.

There are very brief quotes from Titus Brown and Roger Peng. I expect that the author of the item, Erika Check Hayden, spoke to Titus and Roger at length but could just include short bits from each, and so what they say probably doesn’t fully (or much at all) characterize their view of the issue.

Titus is quoted as follows, in reference to another scientist who retracted five papers due to an error in his code:

“That’s the kind of thing that should freak any scientist out…. We don’t have good processes in place to detect that kind of thing in software.”

Roger is quoted at the end, as follows:

“One worry I have is that, with reviews like this, scientists will be even more discouraged from publishing their code…. We need to get more code out there, not improve how it looks.”

I agree with both of them, but my initial reaction, from the beginning of the piece, was closer to Roger’s: We often have a heck of time getting any code out of people; if we are too hard on people regarding the quality of their code, they might become even less willing to share.

On the one hand, we want people to produce good code:

  • that works
  • that’s readable
  • that’s reusable

And it would be great if, for every bit of code, there was a second programmer who studied it, verified that it was doing the right thing, and offered suggestions for improvement.

But, on the other hand, it seems really unlikely that journals have the resources to do this. And I worry that a study showing that much scientific software is crap will make people even less willing to share.

I would like to see the code associated with scientific articles made readily available, during the review process and beyond. But I don’t think we (as a scientific community) want to enforce rigorous code review prior to publication.

Later, on twitter, Titus took issue with the “not improve how it looks” part of what Roger said:

“.@kwbroman @simplystats @rdpeng Please read http://en.wikipedia.org/wiki/Code_review you are deeply, significantly, and completely wrong about code review.”

Characterizing code review as strictly cosmetic was an unfortunate, gross simplification. (And how code looks is important.)

I don’t have enough time this morning to really clean up my thoughts on this issue, and I want to get this out and move on to reading that dissertation that I have to get through by tomorrow. So, let me summarize.

Summary

We want scientific code to be well written: does what it’s intended to do, readable, reusable.

We want scientific code to be available. (Otherwise we can’t verify that it does what it’s intended to do, or reuse it.)

If we’re too hard on people for writing substandard code, we’ll discourage the availability. It’s an important trade-off.

Electronic lab notebook

20 Aug 2013

I was interested to read C. Titus Brown‘s recent post, “Is version control an electronic lab notebook?

I think version control is really important, and I think all computational scientists should have something equivalent to a lab notebook. But I think of version control as serving needs orthogonal to those served by a lab notebook.

As Titus points out, a traditional lab notebook serves two purposes: provenance and protocol. Version control could be useful for provenance, but I don’t really care about provenance. And for protocol, version control doesn’t really matter.

Version control

I really like git with github. (See my tutorial.) But for me, the basic need served by version control is that embodied in the question, “This shit worked before; why isn’t it working now?”

You don’t want to edit working code in place and so possibly break a working system. Version control lets you try things out, and to try something out in any version of your software, from any point in time.

The other basic use of version control is for managing projects with multiple contributors. If there are multiple programmers working on a software project, or multiple authors working on a manuscript, version control is the best way to manage things, particularly for merging everyone’s efforts.

These are really useful things, but version control is more about merging and history and not so much reproducible research.

Make is the thing

To me, the basic tool to make research reproducible is GNU make (see my minimal tutorial). You create a Makefile that documents all analysis steps in a project. (For example, “Use this script to turn these raw data files into that combined file, and use this script to create figure 1 and that script to create figure 2, then combine them with this LaTeX file to make the manuscript PDF.”)

With GNU make (see also rake), you both document and automate these processes. With well-documented/commented scripts and an all-encompassing Makefile, the research is reproducible.

Add knitr, and you’ve got a notebook

The other ingredient to create the computational scientist’s equivalent of a lab notebook is knitr, which allows one to combine text (e.g., in Markdown or asciidoc) and code (e.g., in R) to make documents (e.g., in html or PDF) that both do the work and explain the work. Write such documents to describe what you did and what you learned, and you’ve got an electronic lab notebook.

You could even get rid of your Makefile by having an over-arching knitr-based document that does it all. But I still like make.

But it’s so much work!

Going into a file and deleting a data point is a lot easier than writing a script that does it (and also documents why). But I don’t think you should be going in and changing the data like that, even if it is being tracked by version control. (And that is the main complaint potential users have about version control: “Too time consuming!”)

I think you have to expect that writing well-documented scripts and knitr-based reports that capture the totality of a data analysis project will take a lot of work: perhaps double (or more!) the effort. But it will save a ton of time later (if others care about what you did).

I don’t really want to take this time in the midst of a bout of exploratory data analysis. I find it too inhibiting. So I tend to do a bunch of analyses, capturing the main ideas in a draft R script (or reconstructed later from the .Rhistory file), and then go back later to make a clean knitr-based document that explains what I was doing and why.

It can be hard to force myself to do the clean-up. I wish there were an easier way. But I expect that well-organized lab scientists devote a lot of time to constructing their lab notebooks, too.

Tutorials on git/github and GNU make

10 May 2013

If you’re not using version control, you should be. Learn git.

If you’re not on github, you should be. That’s real open source.

To help some colleagues get started with git and github, I wrote a minimal tutorial. There are lots of git and github resources available, but I thought I’d give just the bare minimum to get started; after using git and github for a while, other resources make a lot more sense and seem much more worthwhile.

And for R folks, note that it’s easy to install R packages that are hosted on github, using Hadley Wickham‘s devtools package. For example, to install Nacho Caballero‘s clickme package:

install.packages("devtools")
library(devtools)
install_github("clickme", "nachocab")

Having written that git/github tutorial, I thought: I should write more such!

So I immediately wrote a similar short tutorial on GNU make, which I think is the most important tool for reproducible research.

Methods before results

29 Apr 2013

It’s great that, in a step towards improved reproducibility, the Nature journals are removing page limits on Methods sections:

To allow authors to describe their experimental designs and methods in enough detail for others to interpret and replicate them, the participating journals are removing length restrictions on Methods sections.

But couldn’t they include the Methods section in the pdf for the article? For example, consider this article in Nature Genetics; the Methods section is only available in the html version of the paper. The PDF says:

Methods and any associated references are available in the online version of the paper.

Methods are important.

  • They shouldn’t be separated from the main text.
  • They shouldn’t be placed after the results (as so many journals, including PLoS, do).
  • They shouldn’t be in a smaller font than the main text (as PNAS does).
  • They certainly shouldn’t be endnotes (as Science used to do).

Supplements annoy me too

I love supplemental material: authors can give the full details, and they can provide as many supplemental figures and tables as they want.

But supplements can be a real pain.

  • I don’t want to have to click on 10 different links. Put it all in one document.
  • I don’t want to have to open Word. Put text and figures in a PDF.
  • I don’t want to have to open Excel. Put data in a plain text file, preferably as part of a git repository with related code.

At least supplements are now included at the journal sites!

This paper in Bioinformatics refers to a separate site for supplemental information:

Expression data and supplementary information are available at
http://www.rii.com/publications/2003/HE_SDS.htm.

But rii.com doesn’t exist anymore. I was able to find the supplement using the Wayback Machine, but

  • The link in the paper was wrong: It should be .html not .htm
  • The final version on Wayback has a corrupted PDF, though one can go back to previous versions that are okay.

I like Genetics and G3

Genetics and G3 put the Methods where they belong (before the results), and when you download the PDF for an article in Genetics, it includes the supplement. For a G3 article, the supplement isn’t included in the article PDF, but at least you can the whole supplement as a single PDF.

For example, consider my recent Genetics articles:

If you click on “Full Text (PDF),” you get the article plus the 3 supplemental figures and 23 supplemental tables in the former case, and article plus the 17 supplemental figures and 2 supplemental tables in the latter case.

Towards making my own papers reproducible

10 Mar 2013

Much has been written about reproducible research: that scientific papers should be accompanied by the data and software sufficient to reproduce the results. It’s obviously a Good Thing. But it can be hard to stick to this ideal in practice.

For my early papers, I’m not sure I can find the materials anymore, and that’s just 15 years back.

For my recent papers, I have developed a sort of system so that I can reproduce the results myself. I use a combination tools, including R, Sweave, perl, and, of course, make.

But I’ve not been distributing the detailed code behind my papers. It’s not always pretty, and it is not well documented. (And “not always” is a bit of a stretch — more like “seldom.”)

When Victoria Stodden visited Madison last fall, I was inspired to release this code, but I never carried through on that.

But then last night I twittered an example graph from a paper, was (appropriately) asked to produce code, and so created a github repository with the bulk of the code for that paper.

The repository is incomplete: it doesn’t include the code to do the main analysis and simulations, but just to make the figures, starting from those results. I’ll work to add those additional details.

And, even once complete, it will be far from perfect. The code is (or will be) there, but it would take a bit of work for an interested reader to figure out what it is doing, since much of it is undocumented and poorly written.

But if we ask for perfection, we’ll get nothing. If we ask for the minimal materials for reproducibility, we may get it.

So that’s my goal: to focus first on minimal accessibility of the code and data behind a paper, even if it is minimally readable and so might take quite a bit of effort for someone else to follow.

One last point: I use local git repositories for my draft analyses and for the whole process of writing a paper. I could post that whole history, but as I said before:

Open source means everyone can see my stupid mistakes. Version control means everyone can see every stupid mistake I’ve ever made.

It would be easy to make my working repository public, but it would include things like referees’ reports and my responses to them, as well as the gory details on all of the stupid things that I might do along the way to publication.

I’m more comfortable releasing just a snapshot of the final product.

A course in statistical programming

25 May 2012

Graduate students in statistics often take (or at least have the opportunity to take) a statistical computing course, but often such courses are focused on methods (like numerical linear algebra, the EM algorithm, and MCMC) and not on actual coding.

For example, here’s a course in “advanced statistical computing” that I taught at Johns Hopkins back in 2001.

Many (perhaps most) good programmers learned to code outside of formal courses. But many statisticians are terrible programmers and would benefit by a formal course.

Moreover, applied statisticians spend the vast majority of their time interacting with a computer and would likely benefit from more formal presentations of how to do it well. And I think this sort of training is particularly important for ensuring that research is reproducible.

One really learns to code in private, struggling over problems, but I benefited enormously from a statistical computing course I took from Phil Spector at Berkeley.

Brian Caffo, Ingo Ruczinski, Roger Peng, Rafael Irizarry, and I developed a statistical programming course at Hopkins that (I think) really did the job.

I would like to develop a similar such course at Wisconsin: on statistical programming, in the most general sense.

I have in mind several basic principles:

  • be self-sufficient
  • get the right answer
  • document what you did (so that you will understand what you did 6 months later)
  • if primary data change, be able to re-run the analysis without a lot of work
  • are your simulation results reproducible?
  • reuse of code (others’ and your own) rather than starting from scratch every time
  • make methods accessible to (and used by) others

Here are my current thoughts about the topics to include in such a course. The key aim would be to make students aware of the basic principles and issues: to give them a good base from which to learn on their own. Homework would include interesting and realistic programming assignments plus create a Sweave-type document and an R package.

  • Basic unix tools (find; df; top; ps ux; grep); unix on Mac and windows
  • Emacs/vim/other editors (rstudio/eclipse)
  • Latex (for papers; for presentations)
  • slides for talks; posters; figures/tables
  • Advanced R (fancy data structures; functions; object-oriented stuff)
  • Advanced R graphics
  • R packages
  • Sweave/asciidoc/knitr
  • minimal Perl (or Python or Ruby); example of data manipulation
  • Minimal C (or C++); examples of speed-up
  • version control (eg git or mercurial); backups
  • reproducible research ideas
  • data management
  • managing projects: data, analyses, results, papers
  • programming style (readable, modular); general but not too general
  • debugging/profiling/testing
  • high-throughput computing; parallel computing; managing big jobs
  • finding answers to questions: man pages; documentation; web
  • more on visualization; dynamic graphics
  • making a web page; html & css; simple cgi-type web forms?
  • writing and managing email
  • managing references to journal articles