Archive for August, 2011

Quick labels within figures

26 Aug 2011

One of the coolest R packages I heard about at the useR! Conference: Toby Dylan Hocking‘s directlabels package for putting labels directly next to the relevant curves or point clouds in a figure.

I think I first learned about this idea from Andrew Gelman: that a separate legend requires a lot of back-and-forth glances, so it’s better to put the labels right by the relevant bits. For example, like this:


useR! Conference 2011 highlights

20 Aug 2011

I was at the useR! Conference at The University of Warwick in Coventry, UK, last week. My goal in going was to learn the latest things regarding (simple) dynamic graphics, (simple) web-based apps, parallel computing, and memory management (dealing with big data sets). I got just what I was hoping for and more. There are a lot of useful tools available that I want to adopt. I’ll summarize the high points below, with the particular areas of interest to me covered more exhaustively than just “highlights”.

I left feeling that my programming skills are crap. My biggest failing is in not making sufficient use of others’ packages, but rather just building what I need from scratch (with great effort) and skipping dynamic graphics completely.


There were 440 participants from 41 countries (342 Europe; 60 North America).

Prof. Brian Ripley [picture taken from here] spoke about the R development process.

  • There are now >3000 packages on CRAN, with 110 submissions per week (of which 80 are successful), basically all handled by Kurt Hornik.
  • CRAN will throw out binaries of packages that are more than two years old.
  • What’s within the base of R will shrink rather than grow.
  • There have been a lot of improvements in the rendering of graphics.
  • R is heavily dependent on a small number of altruistic developers, many of whom feel their contributions are not treated with respect.
  • library() is to be replaced by use().
  • There will soon be a parallel package for parallel computing.


The stupidest R code ever

17 Aug 2011

Let me tell you about my stupidest R mistake.

In the R package that I write, R/qtl, one of the main file formats is a comma-delimited file, where the blank cells in the second row are important, as they distinguish the initial phenotype columns from the genetic marker columns.

I’d gotten some reports that if there were many phenotypes, the import of such a file could take an extremely long time. I ignored the problem (as it wasn’t a problem for me), but eventually it did become a problem for me, and when I investigated, I found the following code.

# determine number of phenotypes based on initial blanks in row 2
n <- ncol(data)
temp <- rep(FALSE,n)
for(i in 1:n) {
  temp[i] <- all(data[2,1:i]=="")
  if(!temp[i]) break
if(!any(temp)) # no phenotypes!
  stop("You must include at least one phenotype (e.g., an index).")
n.phe <- max((1:n)[temp])

Here data is the input matrix, and I use a for loop over columns, looking for the first cell for which all preceding cells were empty. If you can understand the code above, I’m sure you’ll agree that it is really stupid. I think the code was in the package for at least five years, possibly even eight.

For a file with 200 individuals and 1500 phenotypes, it would take about 60 seconds to load; after the fix (below), it took about 2 seconds. I spent 58 seconds looking for the first non-blank cell in the second row!



17 Aug 2011

I’m at the useR! Conference in Coventry, UK, this week. It’s been every bit as inspiring, interesting and useful as I’d hoped.

Particularly interesting were the Lightning talks: a series of 5 minute presentations with one minute in between, with each presentation having 15 slides of 20 sec each, moved forward automatically.  It worked extremely well; more talks should be done this way!

And particularly interesting, among the lightning talks, was one by Tal Galili, who started the R-bloggers blog aggregator. He encouraged the R community to blog. A particularly important point, for me, was his emphasis on not feeling a need to blog at great frequency: even once per year would be worthwhile.

I had a blog in the past, but I felt a constant urge to be posting, and so felt guilty for not posting.  I have enough feelings of guilt in my life, and so I decided to just drop the blog.

Also, my previous effort focused largely on personal matters (particularly my experiences as a new parent).  It perhaps got a bit too personal.  Here, I’m going to focus on less personal things, or at least on things that might embarrass me but aren’t likely to embarrass my family.

And I’ve been thinking about blogging recently.  There are a number of blogs that I read regularly and quite enjoy, particularly those of Andrew Gelman, PZ Myers, Steven Salzberg, Jerry Coyne, and Jen McCreight.  (Apologies to Prof. Coyne; his is a web site, not a blog.)  I have strong opinions, I like to share them, and I’m not often asked to share them, and here I can share freely without prompting.

So: statistics (especially graphics and computing), genetics (especially recombination and QTL mapping), programming (particularly R), and academics (the mistakes I’ve made and what I’ve learned).

Since I generally still feel like an amateur in all of these matters (and I like to say: “You know you’re an expert when everyone else’s work looks just as crappy as your own.”), I won’t worry too much about getting everything right. I certainly will want to avoid comparing myself to the people mentioned above.