Archive for the ‘R’ Category

Write unit tests!

7 Dec 2015

Since 2000, I’ve been working on R/qtl, an R package for mapping the genetic loci (called quantitative trait loci, QTL) that contribute to variation in quantitative traits in experimental crosses. The Bioinformatics paper about it is my most cited; also see my 2014 JORS paper, “Fourteen years of R/qtl: Just barely sustainable.”

It’s a bit of a miracle that R/qtl works and gives the right answers, as it includes essentially no formal tests. The only regular tests are that the examples in the help files don’t produce any errors that halt the code.

I’ve recently been working on R/qtl2, a reimplementation of R/qtl to better handle high-dimensional data and more complex crosses, such as Diversity Outbred mice. In doing so, I’m trying to make use of the software engineering principles that I’ve learned over the last 15 years, which pretty much correspond to the ideas in “Best Practices for Scientific Computing” (Greg Wilson et al., PLOS Biology 12(1): e1001745, doi:10.1371/journal.pbio.1001745).

I’m still working on “Make names consistent, distinctive, and meaningful”, but I’m doing pretty well on writing shorter functions with less repeated code, and particularly importantly I’m writing extensive unit tests.

Fitting linear mixed models for QTL mapping

24 Nov 2015

Linear mixed models (LMMs) have become widely used for dealing with population structure in human GWAS, and they’re becoming increasing important for QTL mapping in model organisms, particularly for the analysis of advanced intercross lines (AIL), which often exhibit variation in the relationships among individuals.

In my efforts on R/qtl2, a reimplementation R/qtl to better handle high-dimensional data and more complex cross designs, it was clear that I’d need to figure out LMMs. But while papers explaining the fit of LMMs seem quite explicit and clear, I’d never quite turned the corner to actually seeing how I’d implement it. In both reading papers and studying code (e.g., lme4), I’d be going along fine and then get completely lost part-way through.

But I now finally understand LMMs, or at least a particular, simple LMM, and I’ve been able to write an implementation: the R package lmmlite.

It seemed worthwhile to write down some of the details.


Session info from R/Travis

25 Sep 2015

For the problem I reported yesterday, in which my R package was working fine locally but failing on Travis, the key solution is to run update.packages(ask=FALSE) locally, and maybe even update.packages(ask=FALSE, type="source") to be sure to grab the source of packages for which binaries are not yet available. I now know to do that.

In addition, it’d be useful to have session information (R and package versions) in the results from Travis. This has proven a bit tricky.

If you don’t want to go with a fully custom Travis script, your customization options are limited. We really only care about the case of a failure, so after_success is not of interest, and after_script seems not to be run if there’s a Travis fail. Moreover, script and after_failure are defined by the main language: r script, so you can’t change them without going all-custom.

What’s left is before_script.

I want to see the result of devtools::session_info() with the package of interest loaded, but the package actually gets built after before_script is run, so we’ll need to build and install it, even though it’ll be built and installed again afterwards. The best I could work out is in this example .travis.yml file, with the key bits being:

  - export PKG_NAME=$(Rscript -e 'cat(paste0(devtools::as.package(".")$package))')
  - export PKG_TARBALL=$(Rscript -e 'pkg <- devtools::as.package("."); cat(paste0(pkg$package,"_",pkg$version,".tar.gz"))')
  - R CMD build --no-build-vignettes .
  - rm ${PKG_TARBALL}
  - echo "Session info:"
  - Rscript -e "library(${PKG_NAME});devtools::session_info('${PKG_NAME}')"

I use --no-build-vignettes in R CMD build as otherwise the package would be built and installed yet another time. And I remove the .tar.gz file afterwards, to avoid having the later check complain about the extra file.

Here’s an example of the session info in the Travis log.

If you have suggests about how to simplify this, I’d be happy to hear them. I guess the key would be to have the main Travis script for R revised to report session information.

Thanks to Jenny Bryan for showing me how to search for instances of session_info in .travis.yml files on GitHub, and to Carson Sievert for further moral support.

It’s not you, it’s me

24 Sep 2015

Somehow when my code stops working, my first (and second, and third) reaction is to blame everything except my own code. (“It’s not me, it’s you.”)

And almost always, it’s my own code that’s the problem (hence the title of this post).

I spent the day trying to resolve a bug in my early-in-development R package, qtl2geno. In the process, I blamed

  • TravisCI for not handling system.file() properly.
  • R-devel for having broken system.file().
  • data.table::fread() for treating sep=NULL differently on different operating systems.

Of course, none of these were true. I was just passing sep=NULL to data.table::fread(), and that worked in the previous version, but doesn’t work in the latest release on CRAN, and I hadn’t yet installed the latest version of data.table on my Mac, but Travis and my junky Windows laptop had the latest version.

The debugging process seems a potentially interesting case study, so I thought I’d write down some of the details.


Randomized Hobbit

22 Jun 2015

@wrathematics pointed me to his ngram R package for constructing and simulating from n-grams from text.

I’d recently grabbed the text of the hobbit, and so I applied it to that text, with amusing results.

Here’s the code I used to grab the text.

stem <- ""
hobbit <- NULL
for(i in 1:74) {
    if(i==1) {
        url <- paste0(stem, ".html")
    } else {
        url <- paste0(stem, "_", i, ".html")

    x <- htmlTreeParse(url, useInternalNodes=TRUE)
    xx <- xpathApply(x, "//p", xmlValue)
    hobbit <- c(hobbit, gsub("\r", "", xx[-length(xx)]))

Then calculate the ngrams with n=2.

ng2 <- ngram(hobbit, n=2)

Simulate some number of words with babble(). If you use the seed argument, you can get the result reproducibly.

babble(ng2, 48, seed=53482175)

into trees, and then bore to the Mountain to go through?” groaned the hobbit. “Well, are you doing, And where are you doing, And where are you?” it squeaked, as it was no answer. They were surly and angry and puzzled at finding them here in their holes

Update: @wrathematics suggested that I mix two texts, so here’s a bit from the Hobbit in the Hat (The Hobbit with 59× Cat in the Hat — up-sampled to match lengths.) But there’s maybe not enough overlap between the two texts to get much of a mixture.

“I am Gandalf,” said the fish. This is no way at all!

already off his horse and among the goblin and the dragon, who had remained behind to guard the door. “Something is outside!” Bilbo’s heart jumped into his boat on to sharp rocks below; but there was a good game, Said our fish No! No! Those Things should not fly.

Initial steps towards reproducible research

4 Dec 2014

In anticipation of next week’s Reproducible Science Hackathon at NESCent, I was thinking about Christie Bahlai’s post on “Baby steps for the open-curious.”

Moving from Ye Olde Standard Computational Science Practice to a fully reproducible workflow seems a monumental task, but partially reproducible is better than not-at-all reproducible, and it’d be good to give people some advice on how to get started – to encourage them to get started.

So, I spent some time today writing another of my minimal tutorials, on initial steps towards reproducible research.

It’s a bit rough, and it could really use some examples, but it helped me to get my thoughts together for the Hackathon and hopefully will be useful to people (and something to build upon).

Error notifications from R

4 Sep 2014

I’m enthusiastic about having R notify me when my script is done.

But among my early uses of this, my script threw an error, and I never got a text or pushbullet about that. And really, I’m even more interested in being notified about such errors than anything else.

It’s relatively easy to get notified of errors. At the top of your script, include code like options(error = function() { } )

Fill in the function with your notification code. If there’s an error, the error message will be printed and then that function will be called. (And then the script will halt.)

You can use geterrmessage() to grab the error message to include in your notification.

For example, if you want to use RPushbullet for the notification, you could put, at the top of your script, something like this:

options(error = function() { 
                    pbPost("note", "Error", geterrmessage())

Then if the script gives an error, you’ll get a note with title “Error” and with the error message as the body of the note.

Update: I knew I’d heard about this sort of thing somewhere, but I couldn’t remember where. Duh; Rasmus mentioned it on twitter just a couple of days ago! Fortunately, he reminded me of that in the comments below.

Another update: Ian Kyle pointed out in the comments that the above function, if used in a script run with R CMD BATCH, won’t actually halt the script. The simplest solution is to add stop(geterrmessage()), like this:

options(error = function() { 
                    pbPost("note", "Error", geterrmessage())
                    if(!interactive()) stop(geterrmessage())

Notifications from R

3 Sep 2014

You just sent a long R job running. How to know when it’s done? Have it notify you by beeping, sending you a text, or sending you a notification via pushbullet.


Yet another R package primer

28 Aug 2014

Hadley Wickham is writing what will surely be a great book about the basics of R packages. And Hilary Parker wrote a very influential post on how to write an R package. So it seems like that topic is well covered.

Nevertheless, I’d been thinking for some time that I should write another minimal tutorial with an alliterative name, on how to turn R code into a package. And it does seem valuable to have a diversity of resources on such an important topic. (R packages are the best way to distribute R code, or just to keep track of your own personal R code, as part of a reproducible research process.)

So I’m going ahead with it, even though it doesn’t seem necessary: the R package primer.

It’s not completely done, but the basic stuff is there.

Testing an R package’s interactive graphs

1 Aug 2014

I’ve been working on an R package, R/qtlcharts, with D3-based interactive graphs for quantitative trait locus mapping experiments.

Testing the interactive charts it produces is a bit of a pain. It seems like I pretty much have to just open a series of examples in a web browser and tab through them manually, checking that they look okay, that the interactions seem to work, and that they’re not giving any sort of errors.

But if I want to post the package to CRAN, it seems (from the CRAN policy) that the examples in the .Rd files shouldn’t be opening a web browser. Thus, I need to surround the example code with \dontrun{}.

But I was using those examples, and R CMD check, to open the series of examples for manual checking.

So, what I’ve decided to do:

  • Include examples opening a browser, but within \dontrun{} so the browser isn’t opened in R CMD check.
  • Also include examples that don’t open the browser, within \dontshow{}, so that R CMD check will at least check the basics.
  • Write a ruby script that pulls out all of the examples from the .Rd files, stripping off the \dontrun{} and \dontshow{} and pasting it all into a .R file.
  • Periodically run R CMD BATCH on that set of examples, to do the manual checking of the interactive graphs.

This will always be a bit of a pain, but with this approach I can do my manual testing in a straightforward way and still fulfill the CRAN policies.

Update: Hadley Wickham pointed me to \donttest{}, added in R ver 2.7 (in 2008). (More value from blog + twitter!)

So I replaced my \dontrun{} bits with \donttest{}. And I can use devtools::run_examples() to run all of the examples, for my manual checks.