The stupidest R code ever

Let me tell you about my stupidest R mistake.

In the R package that I write, R/qtl, one of the main file formats is a comma-delimited file, where the blank cells in the second row are important, as they distinguish the initial phenotype columns from the genetic marker columns.

I’d gotten some reports that if there were many phenotypes, the import of such a file could take an extremely long time. I ignored the problem (as it wasn’t a problem for me), but eventually it did become a problem for me, and when I investigated, I found the following code.

# determine number of phenotypes based on initial blanks in row 2
n <- ncol(data)
temp <- rep(FALSE,n)
for(i in 1:n) {
  temp[i] <- all(data[2,1:i]=="")
  if(!temp[i]) break
}
if(!any(temp)) # no phenotypes!
  stop("You must include at least one phenotype (e.g., an index).")
n.phe <- max((1:n)[temp])

Here data is the input matrix, and I use a for loop over columns, looking for the first cell for which all preceding cells were empty. If you can understand the code above, I’m sure you’ll agree that it is really stupid. I think the code was in the package for at least five years, possibly even eight.

For a file with 200 individuals and 1500 phenotypes, it would take about 60 seconds to load; after the fix (below), it took about 2 seconds. I spent 58 seconds looking for the first non-blank cell in the second row!

In April, 2009, I fixed it (see the commit at the github repository) by replacing the above with the following.

if(data[2,1] != "")
  stop("You must include at least one phenotype (e.g., an index).")
n.phe <- min(which(data[2,] != ""))-1

If you don’t quite understand what I’m talking about, here’s a picture of one of these comma-delimited files; this one has three phenotypes.

Open source means everyone can see my stupid mistakes. Version control means everyone can see every stupid mistake I’ve ever made.

About these ads

Tags: , ,

4 Responses to “The stupidest R code ever”

  1. Tom Says:

    The last line is my favorite. Is there any work on the humility incurred by open source?

    Although my mistakes are far more gruesome, my main problem is I write a script, learn as I go, and by the time I reach the end I realize the beginning was written by a much more inept version of me. I could start again refining, but then wouldn’t the process go into eternity?

  2. Tianfu Says:

    This article reminds me of a little question. Could you take a look?
    In the nearest version, R/qtl 1.21-2, I find that the part you mentioned in this article is:

    empty <- grep("^\\s*$", data[2,])
    n.phe <- min((1:ncol(data))[-empty])-1

    I think maybe it can be like this:

    empty <- grep("^\\s*$", data[2,])
    n.phe <- max(-empty)

    Is there any difference?

    • Karl Broman Says:

      There is a difference, and the latter wouldn’t work.

      empty might be something like c(1, 2, 25), with ncol(data)==150, in which case max(-empty) would give -1, but we’d want 2.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

Join 92 other followers