Fitting linear mixed models for QTL mapping

24 Nov 2015

Linear mixed models (LMMs) have become widely used for dealing with population structure in human GWAS, and they’re becoming increasing important for QTL mapping in model organisms, particularly for the analysis of advanced intercross lines (AIL), which often exhibit variation in the relationships among individuals.

In my efforts on R/qtl2, a reimplementation R/qtl to better handle high-dimensional data and more complex cross designs, it was clear that I’d need to figure out LMMs. But while papers explaining the fit of LMMs seem quite explicit and clear, I’d never quite turned the corner to actually seeing how I’d implement it. In both reading papers and studying code (e.g., lme4), I’d be going along fine and then get completely lost part-way through.

But I now finally understand LMMs, or at least a particular, simple LMM, and I’ve been able to write an implementation: the R package lmmlite.

It seemed worthwhile to write down some of the details.

Read the rest of this entry »

Session info from R/Travis

25 Sep 2015

For the problem I reported yesterday, in which my R package was working fine locally but failing on Travis, the key solution is to run update.packages(ask=FALSE) locally, and maybe even update.packages(ask=FALSE, type="source") to be sure to grab the source of packages for which binaries are not yet available. I now know to do that.

In addition, it’d be useful to have session information (R and package versions) in the results from Travis. This has proven a bit tricky.

If you don’t want to go with a fully custom Travis script, your customization options are limited. We really only care about the case of a failure, so after_success is not of interest, and after_script seems not to be run if there’s a Travis fail. Moreover, script and after_failure are defined by the main language: r script, so you can’t change them without going all-custom.

What’s left is before_script.

I want to see the result of devtools::session_info() with the package of interest loaded, but the package actually gets built after before_script is run, so we’ll need to build and install it, even though it’ll be built and installed again afterwards. The best I could work out is in this example .travis.yml file, with the key bits being:

  - export PKG_NAME=$(Rscript -e 'cat(paste0(devtools::as.package(".")$package))')
  - export PKG_TARBALL=$(Rscript -e 'pkg <- devtools::as.package("."); cat(paste0(pkg$package,"_",pkg$version,".tar.gz"))')
  - R CMD build --no-build-vignettes .
  - rm ${PKG_TARBALL}
  - echo "Session info:"
  - Rscript -e "library(${PKG_NAME});devtools::session_info('${PKG_NAME}')"

I use --no-build-vignettes in R CMD build as otherwise the package would be built and installed yet another time. And I remove the .tar.gz file afterwards, to avoid having the later check complain about the extra file.

Here’s an example of the session info in the Travis log.

If you have suggests about how to simplify this, I’d be happy to hear them. I guess the key would be to have the main Travis script for R revised to report session information.

Thanks to Jenny Bryan for showing me how to search for instances of session_info in .travis.yml files on GitHub, and to Carson Sievert for further moral support.

It’s not you, it’s me

24 Sep 2015

Somehow when my code stops working, my first (and second, and third) reaction is to blame everything except my own code. (“It’s not me, it’s you.”)

And almost always, it’s my own code that’s the problem (hence the title of this post).

I spent the day trying to resolve a bug in my early-in-development R package, qtl2geno. In the process, I blamed

  • TravisCI for not handling system.file() properly.
  • R-devel for having broken system.file().
  • data.table::fread() for treating sep=NULL differently on different operating systems.

Of course, none of these were true. I was just passing sep=NULL to data.table::fread(), and that worked in the previous version, but doesn’t work in the latest release on CRAN, and I hadn’t yet installed the latest version of data.table on my Mac, but Travis and my junky Windows laptop had the latest version.

The debugging process seems a potentially interesting case study, so I thought I’d write down some of the details.

Read the rest of this entry »

Reproducibility is hard

9 Sep 2015

Reproducibility is hard. It will probably always be hard, because it’s hard keeping things organized.

I recently had a paper accepted at G3, concerning a huge set of sample mix-ups in a large eQTL study. I’d discovered and worked out the issue back in December, 2010. I gave a talk about it at the Mouse Genetics meeting in Washington, DC, in June, 2011. But for reasons that I will leave unexplained, I didn’t write it up until much later. I did the bulk of the writing in October, 2012, but it wasn’t until February, 2014, that I posted a preprint at arXiv, which I then finally submitted to G3 in June this year.

In writing up the paper in late 2012, I re-did my entire analysis from scratch, to make the whole thing more cleanly reproducible. So with the paper now in press, I’ve placed all of that in a GitHub repository, but as it turned out, there was still a lot more to do. (You can tell, from the repository, that this is an old project, because there are a couple of Perl scripts in there. It’s been a long time since I’ve switched from Perl to Python and Ruby. I still can’t commit to just one of Python or Ruby…want to stick with Python, as everyone else is using it, but much prefer Ruby.)

The basic issue is that the raw data is about 1 GB. The clean version of the data is another 1 GB. And then there are results of various intermediate calculations, some are rather slow to calculate, which take up another 100 MB. I can’t reasonably put all of that within the GitHub repository.

Both the raw and clean data have been posted in the Mouse Phenome Database. (Thanks to Petr Simecek, Gary Churchill, Molly Bogue, and Elissa Chesler for that!) But the data are in a form that I thought suitable for others, and not quite in the form that I actually used them.

So, I needed to write a script that would grab the data files from MPD and reorganize them in the way that I’d been using them.

In working on that, I discovered some mistakes in the data posted to MPD: there were a couple of bugs in my code to convert the data from the format I was using into the format I was going to post. (So it was good to spend the time on the script that did the reverse!)

In addition to the raw and clean data on MPD, I posted a zip file with the 110 MB of intermediate results on figshare.

In the end, I’m hoping that one can clone the GitHub repository and just run make and it will download the data and do all of the stuff. If you want to save some time, you could download the zip file from figshare and unzip that, and then run make.

I’m not quite there, but I think I’m close.

Aspects I’m happy with

For the most part, my work on this project wasn’t terrible.

  • I wrote an R package, R/lineup, with the main analysis methods.
  • That I re-derived the full entire analysis cleanly, in a separate, reproducible document (I used AsciiDoc and knitr) was a good thing.
  • The code for the figures and tables are all reasonably clean, and draw from either the original data files or from intermediate calculations produced by the AsciiDoc document.
  • I automated everything with GNU Make.

What should I have done differently?

There was a lot more after-the-fact work that I would rather not have to do.

Making a project reproducible is easier if the data aren’t that large and so can be bundled into the GitHub repository with all of the code.

With a larger data set, I guess the thing to do is recognize, from the start, that the data are going to be sitting elsewhere. So then, I think one should organize the data in the form that you expect to be made public, and work from those files.

When you write a script to convert data from one form to another, also write some tests, to make sure that it worked correctly.

And then document, document, document! As with software development, it’s hard to document data or analyses after the fact.

God-awful conference websites

5 Aug 2015

What do I want in a conference website? Not this.

  • I want to be able to browse sessions to find the ones I’m interested in. That means being able to see the session title and time as well as the speakers and talk titles. A super-long web page is perfectly fine.
  • If you can’t show me everything at once, at least let me click-to-expand: for the talk titles, and then for the abstracts. Otherwise I have to keep clicking and going back.
  • I want to be able to search for people. And if I’m searching for Hao Wu, I don’t want to look at all of the Wus. Or all of the Haos. I just want the Hao Wus. If I can’t search on "Hao Wu", at least let me search on "Wu, Hao".
  • If my search returns nothing and I go back, bring me back to the same search form. Don’t make me have to click “Search for people” again.
  • I’d like to be able to form a schedule of the sessions to attend. (JSM2015 does that okay, but it’s not what I’d call “secure” and you have to find the damned things, first.) Really, I want to pick particular talks: this one in that session and that one in the other. But yeah, that seems a bit much to ask.

The JSM 2015 site is so terrible for browsing, I was happy to get the pdf of the program. (Good luck finding it on the website on your own; ASA tweeted the link to me, due to my bitching and moaning.) You can browse the pdf. That’s the way I ended up finding the sessions I wanted to attend. It also had an ad for the JSM 2015 mobile app. Did you know there was one? Good luck finding a link to that on their website, either.

The pdf is useable, but much like the website, it fails to make use of the medium. I want:

  • Bookmarks. I want to jump to where Monday’s sessions start without have to flip through the whole thing.
  • Hyperlinks. If you don’t include the abstracts, with links from the talk titles to the abstracts, at least include links to the web page that has the abstract so I don’t have to search on the web.
  • More hyperlinks. The pdf has an index, with people and page numbers. Why not link those page numbers to the corresponding page?

I helped organize a small meeting in 2013. The program on the web and the corresponding pdf illustrate much of what I want. (No scheduling feature, but that meeting had no simultaneous sessions.) I included gratuitous network graphs of the authors and abstracts. It’s 2015. No conference site is truly complete without interactive network graphs.


As Thomas Lumley commented below, if you search on “Wu” you get all of the “Wu”s but also there’s one “Wulfhorst”. And if you search on “Hao” you get only people whose last name is “Hao”.

He further pointed out that if you search for the affiliation “Auckland” the results don’t include “University of Auckland” but only “Auckland University of Technology”. And actually, if you search for “University of Auckland” you get nothing. You need to search for “The University of Auckland”.

Memories of past JSMs

3 Aug 2015

The Joint Statistical Meetings (JSM) are the big statistics meetings in North America, “joint” among the American Statistical Association, Institute of Mathematical Statistics, International Biometric Society (ENAR and WNAR), Statistical Society of Canada, and others.

JSM 2015 is next week, in Seattle. In anticipation, I thought I’d write down some of my main memories of past JSMs.
Read the rest of this entry »

MongoDB with D3.js

22 Jun 2015

I consider interactive data visualization to be the critical tool for exploration of high-dimensional data.

That’s led me to spend a good amount of time in the last few years learning some new skills (D3 and CoffeeScript) and developing some new tools, particularly the R package R/qtlcharts, which provides interactive versions of the many data visualizations in R/qtl, my long-in-development R package for mapping genetic loci (called quantitative trait loci, QTL) that underlie complex trait variation in experimental organisms.

R/qtlcharts is rough in spots, and while it works well for moderate-sized data sets, it can’t well handle truly large-scale data, as it just dumps all of the data into the file viewed by a web browser.

For large-scale data, one needs to dynamically load slices of the data based on user interactions. It seems best to have a formal database behind the scenes. But I think I’m not unusual, among statisticians, in having almost no experience working with databases. My collaborators tend to keep things in Excel. Even for quite large problems, I keep things in flat files.

So, I’ve been trying to come to understand the whole database business, and how I might use one for larger-scale data visualizations. I think I’ve finally made that last little conceptual step, where I can see what I need to do. I made a small illustration in my d3examples repository on GitHub.

Read the rest of this entry »

Randomized Hobbit

22 Jun 2015

@wrathematics pointed me to his ngram R package for constructing and simulating from n-grams from text.

I’d recently grabbed the text of the hobbit, and so I applied it to that text, with amusing results.

Here’s the code I used to grab the text.

stem <- ""
hobbit <- NULL
for(i in 1:74) {
    if(i==1) {
        url <- paste0(stem, ".html")
    } else {
        url <- paste0(stem, "_", i, ".html")

    x <- htmlTreeParse(url, useInternalNodes=TRUE)
    xx <- xpathApply(x, "//p", xmlValue)
    hobbit <- c(hobbit, gsub("\r", "", xx[-length(xx)]))

Then calculate the ngrams with n=2.

ng2 <- ngram(hobbit, n=2)

Simulate some number of words with babble(). If you use the seed argument, you can get the result reproducibly.

babble(ng2, 48, seed=53482175)

into trees, and then bore to the Mountain to go through?” groaned the hobbit. “Well, are you doing, And where are you doing, And where are you?” it squeaked, as it was no answer. They were surly and angry and puzzled at finding them here in their holes

Update: @wrathematics suggested that I mix two texts, so here’s a bit from the Hobbit in the Hat (The Hobbit with 59× Cat in the Hat — up-sampled to match lengths.) But there’s maybe not enough overlap between the two texts to get much of a mixture.

“I am Gandalf,” said the fish. This is no way at all!

already off his horse and among the goblin and the dragon, who had remained behind to guard the door. “Something is outside!” Bilbo’s heart jumped into his boat on to sharp rocks below; but there was a good game, Said our fish No! No! Those Things should not fly.

Cheat sheets for R-based Software Carpentry course

29 Apr 2015

At the Software Carpentry workshop at UW-Madison in August, 2014, one of the students suggested that we hand out some cheat sheets on each topic. I thought that was a really good idea.

So at the SWC workshop at Washington State University this week, we handed out the following five pages:

I really appreciate the work (and design sense) that were put into these.

2015 AAAS in San Jose

13 Feb 2015

I’m at the 2015 AAAS meeting in San Jose, California. This is definitely not my typical meeting: too big, too broad, and I hardly know anyone here. But here’s a quick (ha ha; yah sure) summary of the meeting so far.

Opening night

Gerald Fink gave the President’s Address last night. He’s the AAAS President, so I guess that’s appropriate. But after five minutes of really lame simplistic crap (for example, he said something like, “A single picture can destroy our known understanding of the universe,” like innovation and improving our understanding is a bad thing), I left.

Oh, and before that: the emcee of the evening, who introduced Janet Napolitano, totally couldn’t pronounce her last name. (Her remarks, particularly her comments in support of public universities, were quite powerful.) Old dude: practice such things! Your ineptness reveals that you haven’t paid proper attention to her.


A huge meeting, but I know next to no one here. But I ran into Sanjay Shete in the exhibit hall, where I attempted to get two of every tchotchke. (My kids will pitch a fit if one gets something, no matter how lame the thing, and the other doesn’t.) Sanjay was named AAAS Fellow, that’s why he’s here.

I also ran into Steve Goodman (not the folk singer who died too young, but a singer, nevertheless). Gotta love Steve Goodman! He produced Behind the tan door.


I went to a dozen talks. A half-dozen I really liked.

Alan Aspuru-Guzik talked about how to find (and visualize) useful organic molecules among the 1060 (or 10180?) possible. Cool high-throughput computing and interactive graphics to produce better solar panels (particularly for developing countries) and huge batteries to store wind- and solar-based power.

Russ Altman talked about how to search databases, web-search histories, and social media, to identify pairs of drugs that, together, give bad (or good) side effects that wouldn’t be predicted from their on-their-own side effects.

David Altshuler had a hilarious outline slide for his talk, but the rest was really awesome. A key point: to develop precision medicine will require hard work and there’s no magic bullet. And basic (not just translational) research is critical: we can’t make a medicine that gets to the precise cause (and that’s what precision medicine is about) if we don’t understand that basic biology.

I gave a talk myself, in a session on visualization of biomedical data, but it was definitely not the best talk in the session, nor the second best. Mine might have been the worst of the five talks in the session. But that’s okay; I think I did fine. It’s just that Sean Hanlon (brother of my UW–Madison colleague, Bret Hanlon) put together a superb, but thinly-attended, session.

Miriah Meyer’s was my favorite talk of the day. She develops visualization tools to help scientists make sense of their data. And her approach is much like mine: specific solutions to specific data and questions. She talked about MulteeSum, PathLine, and MizBee. Favorite quote: “It’s amazing how much people like circles these days.”

Frederick Streitz from Lawrence Livermore National Lab talked about simulating and visualizing the electrophysiology of the human heart at super-high resolution using a frigging huge cluster, with 1.5 million cores. I loved his analogies: if you are painting your house, having a friend or two over to help will reduce the time by the expected factor, but having 1000 friends or 100k friends to help? In parallel computing, you need to rethink what you’ll use the computers for.

His second analogy: The DOE cluster at Livermore is 100k times a desktop computer. That’s like the difference between PacMan (1980, 2.1 megaFLOPS) to Assassin’s Creed (2011, 260 GigaFLOPS). And their cluster is 100k times that.

At the end of the day, Daphne Koller talked about Coursera. She’s awesome; Coursera’s awesome; I’m a crappy teacher. That’s my thinking at the moment, anyway. (A video of her talk is online. Have I mentioned how much I hate it when people screw up the aspect ratio? It seems like they screwed up the aspect ratio.) University faculty exist to help people, and with Coursera and other MOOCs, we can help a lot of people. Key lessons: the value of peer grading (for learning), not being constrained by the classroom or the 60-min format, ability to explore possible teaching innovations, and just having a hugely broad reach.

I don’t think I’d heard the quote that Daphne mentioned, attributed to Edwin Emery Slosson:

College is a place where a professor’s lecture notes go straight to the students’ lecture notes, without passing through the brains of either.

Still laughing!


As I mentioned on twitter, I’ve eaten a lot of tacos. But I also had some donuts.

Boy, am I old

I seem to be staying at the same hotel as the American Junior Academy of Sciences (AJAS). Are these high school or college students? Man, do I feel old.

My contribution to education, today: if all of the elevators going down are too packed to accept passengers, press the up button and ride it up and then down. (Later I learned, from one of the AJAS youth, that the “alarm will sound” sign at the bottom of the stairs is a lie. You can take the stairs.)


Get every new post delivered to your Inbox.

Join 118 other followers