Archive for the ‘Programming’ Category

Write unit tests!

7 Dec 2015

Since 2000, I’ve been working on R/qtl, an R package for mapping the genetic loci (called quantitative trait loci, QTL) that contribute to variation in quantitative traits in experimental crosses. The Bioinformatics paper about it is my most cited; also see my 2014 JORS paper, “Fourteen years of R/qtl: Just barely sustainable.”

It’s a bit of a miracle that R/qtl works and gives the right answers, as it includes essentially no formal tests. The only regular tests are that the examples in the help files don’t produce any errors that halt the code.

I’ve recently been working on R/qtl2, a reimplementation of R/qtl to better handle high-dimensional data and more complex crosses, such as Diversity Outbred mice. In doing so, I’m trying to make use of the software engineering principles that I’ve learned over the last 15 years, which pretty much correspond to the ideas in “Best Practices for Scientific Computing” (Greg Wilson et al., PLOS Biology 12(1): e1001745, doi:10.1371/journal.pbio.1001745).

I’m still working on “Make names consistent, distinctive, and meaningful”, but I’m doing pretty well on writing shorter functions with less repeated code, and particularly importantly I’m writing extensive unit tests.

Fitting linear mixed models for QTL mapping

24 Nov 2015

Linear mixed models (LMMs) have become widely used for dealing with population structure in human GWAS, and they’re becoming increasing important for QTL mapping in model organisms, particularly for the analysis of advanced intercross lines (AIL), which often exhibit variation in the relationships among individuals.

In my efforts on R/qtl2, a reimplementation R/qtl to better handle high-dimensional data and more complex cross designs, it was clear that I’d need to figure out LMMs. But while papers explaining the fit of LMMs seem quite explicit and clear, I’d never quite turned the corner to actually seeing how I’d implement it. In both reading papers and studying code (e.g., lme4), I’d be going along fine and then get completely lost part-way through.

But I now finally understand LMMs, or at least a particular, simple LMM, and I’ve been able to write an implementation: the R package lmmlite.

It seemed worthwhile to write down some of the details.


Session info from R/Travis

25 Sep 2015

For the problem I reported yesterday, in which my R package was working fine locally but failing on Travis, the key solution is to run update.packages(ask=FALSE) locally, and maybe even update.packages(ask=FALSE, type="source") to be sure to grab the source of packages for which binaries are not yet available. I now know to do that.

In addition, it’d be useful to have session information (R and package versions) in the results from Travis. This has proven a bit tricky.

If you don’t want to go with a fully custom Travis script, your customization options are limited. We really only care about the case of a failure, so after_success is not of interest, and after_script seems not to be run if there’s a Travis fail. Moreover, script and after_failure are defined by the main language: r script, so you can’t change them without going all-custom.

What’s left is before_script.

I want to see the result of devtools::session_info() with the package of interest loaded, but the package actually gets built after before_script is run, so we’ll need to build and install it, even though it’ll be built and installed again afterwards. The best I could work out is in this example .travis.yml file, with the key bits being:

  - export PKG_NAME=$(Rscript -e 'cat(paste0(devtools::as.package(".")$package))')
  - export PKG_TARBALL=$(Rscript -e 'pkg <- devtools::as.package("."); cat(paste0(pkg$package,"_",pkg$version,".tar.gz"))')
  - R CMD build --no-build-vignettes .
  - rm ${PKG_TARBALL}
  - echo "Session info:"
  - Rscript -e "library(${PKG_NAME});devtools::session_info('${PKG_NAME}')"

I use --no-build-vignettes in R CMD build as otherwise the package would be built and installed yet another time. And I remove the .tar.gz file afterwards, to avoid having the later check complain about the extra file.

Here’s an example of the session info in the Travis log.

If you have suggests about how to simplify this, I’d be happy to hear them. I guess the key would be to have the main Travis script for R revised to report session information.

Thanks to Jenny Bryan for showing me how to search for instances of session_info in .travis.yml files on GitHub, and to Carson Sievert for further moral support.

It’s not you, it’s me

24 Sep 2015

Somehow when my code stops working, my first (and second, and third) reaction is to blame everything except my own code. (“It’s not me, it’s you.”)

And almost always, it’s my own code that’s the problem (hence the title of this post).

I spent the day trying to resolve a bug in my early-in-development R package, qtl2geno. In the process, I blamed

  • TravisCI for not handling system.file() properly.
  • R-devel for having broken system.file().
  • data.table::fread() for treating sep=NULL differently on different operating systems.

Of course, none of these were true. I was just passing sep=NULL to data.table::fread(), and that worked in the previous version, but doesn’t work in the latest release on CRAN, and I hadn’t yet installed the latest version of data.table on my Mac, but Travis and my junky Windows laptop had the latest version.

The debugging process seems a potentially interesting case study, so I thought I’d write down some of the details.


MongoDB with D3.js

22 Jun 2015

I consider interactive data visualization to be the critical tool for exploration of high-dimensional data.

That’s led me to spend a good amount of time in the last few years learning some new skills (D3 and CoffeeScript) and developing some new tools, particularly the R package R/qtlcharts, which provides interactive versions of the many data visualizations in R/qtl, my long-in-development R package for mapping genetic loci (called quantitative trait loci, QTL) that underlie complex trait variation in experimental organisms.

R/qtlcharts is rough in spots, and while it works well for moderate-sized data sets, it can’t well handle truly large-scale data, as it just dumps all of the data into the file viewed by a web browser.

For large-scale data, one needs to dynamically load slices of the data based on user interactions. It seems best to have a formal database behind the scenes. But I think I’m not unusual, among statisticians, in having almost no experience working with databases. My collaborators tend to keep things in Excel. Even for quite large problems, I keep things in flat files.

So, I’ve been trying to come to understand the whole database business, and how I might use one for larger-scale data visualizations. I think I’ve finally made that last little conceptual step, where I can see what I need to do. I made a small illustration in my d3examples repository on GitHub.


Cheat sheets for R-based Software Carpentry course

29 Apr 2015

At the Software Carpentry workshop at UW-Madison in August, 2014, one of the students suggested that we hand out some cheat sheets on each topic. I thought that was a really good idea.

So at the SWC workshop at Washington State University this week, we handed out the following five pages:

I really appreciate the work (and design sense) that were put into these.

Yet another R package primer

28 Aug 2014

Hadley Wickham is writing what will surely be a great book about the basics of R packages. And Hilary Parker wrote a very influential post on how to write an R package. So it seems like that topic is well covered.

Nevertheless, I’d been thinking for some time that I should write another minimal tutorial with an alliterative name, on how to turn R code into a package. And it does seem valuable to have a diversity of resources on such an important topic. (R packages are the best way to distribute R code, or just to keep track of your own personal R code, as part of a reproducible research process.)

So I’m going ahead with it, even though it doesn’t seem necessary: the R package primer.

It’s not completely done, but the basic stuff is there.

Testing an R package’s interactive graphs

1 Aug 2014

I’ve been working on an R package, R/qtlcharts, with D3-based interactive graphs for quantitative trait locus mapping experiments.

Testing the interactive charts it produces is a bit of a pain. It seems like I pretty much have to just open a series of examples in a web browser and tab through them manually, checking that they look okay, that the interactions seem to work, and that they’re not giving any sort of errors.

But if I want to post the package to CRAN, it seems (from the CRAN policy) that the examples in the .Rd files shouldn’t be opening a web browser. Thus, I need to surround the example code with \dontrun{}.

But I was using those examples, and R CMD check, to open the series of examples for manual checking.

So, what I’ve decided to do:

  • Include examples opening a browser, but within \dontrun{} so the browser isn’t opened in R CMD check.
  • Also include examples that don’t open the browser, within \dontshow{}, so that R CMD check will at least check the basics.
  • Write a ruby script that pulls out all of the examples from the .Rd files, stripping off the \dontrun{} and \dontshow{} and pasting it all into a .R file.
  • Periodically run R CMD BATCH on that set of examples, to do the manual checking of the interactive graphs.

This will always be a bit of a pain, but with this approach I can do my manual testing in a straightforward way and still fulfill the CRAN policies.

Update: Hadley Wickham pointed me to \donttest{}, added in R ver 2.7 (in 2008). (More value from blog + twitter!)

So I replaced my \dontrun{} bits with \donttest{}. And I can use devtools::run_examples() to run all of the examples, for my manual checks.

Why hadn’t I written a function for that?

16 Jul 2014

I’m often typing the same bits of code over and over. Those bits of code really should be made into functions.

For example, I’m still using base graphics. (ggplot2 is on my “to do” list, really!) Often some things will be drawn with a slight overlap of the border of the plotting region. And in heatmaps with image, the border is often obscured. I want a nice black rectangle around the outside.

So I’ll write the following:

u <- par("usr")
rect(u[1], u[3], u[2], u[4])

I don’t know how many times I’ve typed that! Today I realized that I should put those two lines in a function add_border(). And then I added add_border() to my R/broman package.

It was a bit more work adding the Roxygen2 comments for the documentation, but now I’ve got a proper function that is easier to use and much more clear.

Update: @tpoi pointed out that box() does the same thing as my add_border(). My general point still stands, and this raises the additional point: twitter + blog → education.

I want to add, “I’m an idiot” but I think I’ll just say that there’s always more that I can learn about R. And I’ll remove add_border from R/broman and just use box().

Simple web sites with GitHub Pages

5 Apr 2014

If you love git and GitHub, you’ll also love GitHub Pages, for producing websites using Markdown and git: you write pages in Markdown within a gh-pages branch in a git repository; when you push to GitHub, a corresponding site is automatically constructed.