Towards making my own papers reproducible

Much has been written about reproducible research: that scientific papers should be accompanied by the data and software sufficient to reproduce the results. It’s obviously a Good Thing. But it can be hard to stick to this ideal in practice.

For my early papers, I’m not sure I can find the materials anymore, and that’s just 15 years back.

For my recent papers, I have developed a sort of system so that I can reproduce the results myself. I use a combination tools, including R, Sweave, perl, and, of course, make.

But I’ve not been distributing the detailed code behind my papers. It’s not always pretty, and it is not well documented. (And “not always” is a bit of a stretch — more like “seldom.”)

When Victoria Stodden visited Madison last fall, I was inspired to release this code, but I never carried through on that.

But then last night I twittered an example graph from a paper, was (appropriately) asked to produce code, and so created a github repository with the bulk of the code for that paper.

The repository is incomplete: it doesn’t include the code to do the main analysis and simulations, but just to make the figures, starting from those results. I’ll work to add those additional details.

And, even once complete, it will be far from perfect. The code is (or will be) there, but it would take a bit of work for an interested reader to figure out what it is doing, since much of it is undocumented and poorly written.

But if we ask for perfection, we’ll get nothing. If we ask for the minimal materials for reproducibility, we may get it.

So that’s my goal: to focus first on minimal accessibility of the code and data behind a paper, even if it is minimally readable and so might take quite a bit of effort for someone else to follow.

One last point: I use local git repositories for my draft analyses and for the whole process of writing a paper. I could post that whole history, but as I said before:

Open source means everyone can see my stupid mistakes. Version control means everyone can see every stupid mistake I’ve ever made.

It would be easy to make my working repository public, but it would include things like referees’ reports and my responses to them, as well as the gory details on all of the stupid things that I might do along the way to publication.

I’m more comfortable releasing just a snapshot of the final product.


Tags: ,

9 Responses to “Towards making my own papers reproducible”

  1. Leonardo Collado T (@fellgernon) Says:

    While it would certainly take time to completely understand the code, I have to say that thanks to your coding style (spaces, variable names) it looks like it wouldn’t be too hard to do. So don’t be too hard on yourself when you say that the code is poorly written.
    As a plus, just explaining the objects in the Rdata files would help quite a bit. If they are data frames, explaining what the columns are would be nice.
    This is one way of doing it. Write an Rmd file where you load the data and do head() on each object. Then briefly explain what each column is (or element if a list, or it’s values if a vector, etc). If you call it README.Rmd then the would be shown in GitHub automatically =) I would save it in your R directory:

  2. hilaryparker Says:

    I’m with you on the “stupid mistakes along the way.” I’m happy putting stuff up on github once it’s finished (even after the first submission — don’t mind people tracking my bug fixes or minor revisions). But having it on github from day 1 is asking quite a different thing than having it on github once it’s mostly finished.

    I think ideally we all would just judge each other less for the messy process of statistical discovery. But then again, that’s just a pipe dream.

    • Karl Broman Says:

      The github repository for the talk I gave at Kansas State has the whole history. I’m thinking I should scrap it and just make it a snapshot of the final talk. An advantage of bitbucket over github is that you don’t have to pay to have private repositories (for academics, anyway).

  3. Ken Butler Says:

    It strikes me that only putting up the final version is no different from going to the theatre and seeing only the rehearsed version of a play. The process of how the play came to become what you see is kind of interesting, but it’s not the reason you go to the theatre.

Comments are closed.

%d bloggers like this: