Archive for September, 2015

Session info from R/Travis

25 Sep 2015

For the problem I reported yesterday, in which my R package was working fine locally but failing on Travis, the key solution is to run update.packages(ask=FALSE) locally, and maybe even update.packages(ask=FALSE, type="source") to be sure to grab the source of packages for which binaries are not yet available. I now know to do that.

In addition, it’d be useful to have session information (R and package versions) in the results from Travis. This has proven a bit tricky.

If you don’t want to go with a fully custom Travis script, your customization options are limited. We really only care about the case of a failure, so after_success is not of interest, and after_script seems not to be run if there’s a Travis fail. Moreover, script and after_failure are defined by the main language: r script, so you can’t change them without going all-custom.

What’s left is before_script.

I want to see the result of devtools::session_info() with the package of interest loaded, but the package actually gets built after before_script is run, so we’ll need to build and install it, even though it’ll be built and installed again afterwards. The best I could work out is in this example .travis.yml file, with the key bits being:

before_script:
  - export PKG_NAME=$(Rscript -e 'cat(paste0(devtools::as.package(".")$package))')
  - export PKG_TARBALL=$(Rscript -e 'pkg <- devtools::as.package("."); cat(paste0(pkg$package,"_",pkg$version,".tar.gz"))')
  - R CMD build --no-build-vignettes .
  - R CMD INSTALL ${PKG_TARBALL}
  - rm ${PKG_TARBALL}
  - echo "Session info:"
  - Rscript -e "library(${PKG_NAME});devtools::session_info('${PKG_NAME}')"

I use --no-build-vignettes in R CMD build as otherwise the package would be built and installed yet another time. And I remove the .tar.gz file afterwards, to avoid having the later check complain about the extra file.

Here’s an example of the session info in the Travis log.

If you have suggests about how to simplify this, I’d be happy to hear them. I guess the key would be to have the main Travis script for R revised to report session information.

Thanks to Jenny Bryan for showing me how to search for instances of session_info in .travis.yml files on GitHub, and to Carson Sievert for further moral support.

It’s not you, it’s me

24 Sep 2015

Somehow when my code stops working, my first (and second, and third) reaction is to blame everything except my own code. (“It’s not me, it’s you.”)

And almost always, it’s my own code that’s the problem (hence the title of this post).

I spent the day trying to resolve a bug in my early-in-development R package, qtl2geno. In the process, I blamed

  • TravisCI for not handling system.file() properly.
  • R-devel for having broken system.file().
  • data.table::fread() for treating sep=NULL differently on different operating systems.

Of course, none of these were true. I was just passing sep=NULL to data.table::fread(), and that worked in the previous version, but doesn’t work in the latest release on CRAN, and I hadn’t yet installed the latest version of data.table on my Mac, but Travis and my junky Windows laptop had the latest version.

The debugging process seems a potentially interesting case study, so I thought I’d write down some of the details.

(more…)

Reproducibility is hard

9 Sep 2015

Reproducibility is hard. It will probably always be hard, because it’s hard keeping things organized.

I recently had a paper accepted at G3, concerning a huge set of sample mix-ups in a large eQTL study. I’d discovered and worked out the issue back in December, 2010. I gave a talk about it at the Mouse Genetics meeting in Washington, DC, in June, 2011. But for reasons that I will leave unexplained, I didn’t write it up until much later. I did the bulk of the writing in October, 2012, but it wasn’t until February, 2014, that I posted a preprint at arXiv, which I then finally submitted to G3 in June this year.

In writing up the paper in late 2012, I re-did my entire analysis from scratch, to make the whole thing more cleanly reproducible. So with the paper now in press, I’ve placed all of that in a GitHub repository, but as it turned out, there was still a lot more to do. (You can tell, from the repository, that this is an old project, because there are a couple of Perl scripts in there. It’s been a long time since I’ve switched from Perl to Python and Ruby. I still can’t commit to just one of Python or Ruby…want to stick with Python, as everyone else is using it, but much prefer Ruby.)

The basic issue is that the raw data is about 1 GB. The clean version of the data is another 1 GB. And then there are results of various intermediate calculations, some are rather slow to calculate, which take up another 100 MB. I can’t reasonably put all of that within the GitHub repository.

Both the raw and clean data have been posted in the Mouse Phenome Database. (Thanks to Petr Simecek, Gary Churchill, Molly Bogue, and Elissa Chesler for that!) But the data are in a form that I thought suitable for others, and not quite in the form that I actually used them.

So, I needed to write a script that would grab the data files from MPD and reorganize them in the way that I’d been using them.

In working on that, I discovered some mistakes in the data posted to MPD: there were a couple of bugs in my code to convert the data from the format I was using into the format I was going to post. (So it was good to spend the time on the script that did the reverse!)

In addition to the raw and clean data on MPD, I posted a zip file with the 110 MB of intermediate results on figshare.

In the end, I’m hoping that one can clone the GitHub repository and just run make and it will download the data and do all of the stuff. If you want to save some time, you could download the zip file from figshare and unzip that, and then run make.

I’m not quite there, but I think I’m close.

Aspects I’m happy with

For the most part, my work on this project wasn’t terrible.

  • I wrote an R package, R/lineup, with the main analysis methods.
  • That I re-derived the full entire analysis cleanly, in a separate, reproducible document (I used AsciiDoc and knitr) was a good thing.
  • The code for the figures and tables are all reasonably clean, and draw from either the original data files or from intermediate calculations produced by the AsciiDoc document.
  • I automated everything with GNU Make.

What should I have done differently?

There was a lot more after-the-fact work that I would rather not have to do.

Making a project reproducible is easier if the data aren’t that large and so can be bundled into the GitHub repository with all of the code.

With a larger data set, I guess the thing to do is recognize, from the start, that the data are going to be sitting elsewhere. So then, I think one should organize the data in the form that you expect to be made public, and work from those files.

When you write a script to convert data from one form to another, also write some tests, to make sure that it worked correctly.

And then document, document, document! As with software development, it’s hard to document data or analyses after the fact.