Archive for August, 2013

“[” and “[[” with the apply() functions

20 Aug 2013

Did you know you can use "[" and "[[" as function names for subsetting with calls to the apply-type functions?

For example, suppose you have a bunch of identifier strings like "ZYY-43S-CWA3" and you want to pull off the bit before the first hyphen ("ZYY" in this case). (For code to create random IDs like that, see the end of this post.)

Suppose the IDs are in a vector of character strings, id.

If I wanted to grab the bit before the first hyphen, I would typically use strsplit and then sapply with function(a) a[1], as so:

sapply(strsplit(id, "-"), function(a) a[1])

But in place of function(a) a[1], you can use "[", 1, as follows:

sapply(strsplit(id, "-"), "[", 1)

I think that’s kind of cute. You can use "[[" the same way, if you’re working with lists.

Here’s some code to create random IDs of this form, to test out the above:

nind <- 8
lengths <- c(3, 3, 4)
id <- NULL
for(i in seq(along=lengths)) {
  randchar <- sample(c(LETTERS, 0:9), nind*lengths[i], replace=TRUE)
  randstring <- apply(matrix(randchar, ncol=lengths[i]),
                   1, paste, collapse="")
  if(is.null(id)) id <- randstring
  else id <- paste(id, randstring, sep="-")
}

Electronic lab notebook

20 Aug 2013

I was interested to read C. Titus Brown‘s recent post, “Is version control an electronic lab notebook?

I think version control is really important, and I think all computational scientists should have something equivalent to a lab notebook. But I think of version control as serving needs orthogonal to those served by a lab notebook.

As Titus points out, a traditional lab notebook serves two purposes: provenance and protocol. Version control could be useful for provenance, but I don’t really care about provenance. And for protocol, version control doesn’t really matter.

Version control

I really like git with github. (See my tutorial.) But for me, the basic need served by version control is that embodied in the question, “This shit worked before; why isn’t it working now?”

You don’t want to edit working code in place and so possibly break a working system. Version control lets you try things out, and to try something out in any version of your software, from any point in time.

The other basic use of version control is for managing projects with multiple contributors. If there are multiple programmers working on a software project, or multiple authors working on a manuscript, version control is the best way to manage things, particularly for merging everyone’s efforts.

These are really useful things, but version control is more about merging and history and not so much reproducible research.

Make is the thing

To me, the basic tool to make research reproducible is GNU make (see my minimal tutorial). You create a Makefile that documents all analysis steps in a project. (For example, “Use this script to turn these raw data files into that combined file, and use this script to create figure 1 and that script to create figure 2, then combine them with this LaTeX file to make the manuscript PDF.”)

With GNU make (see also rake), you both document and automate these processes. With well-documented/commented scripts and an all-encompassing Makefile, the research is reproducible.

Add knitr, and you’ve got a notebook

The other ingredient to create the computational scientist’s equivalent of a lab notebook is knitr, which allows one to combine text (e.g., in Markdown or asciidoc) and code (e.g., in R) to make documents (e.g., in html or PDF) that both do the work and explain the work. Write such documents to describe what you did and what you learned, and you’ve got an electronic lab notebook.

You could even get rid of your Makefile by having an over-arching knitr-based document that does it all. But I still like make.

But it’s so much work!

Going into a file and deleting a data point is a lot easier than writing a script that does it (and also documents why). But I don’t think you should be going in and changing the data like that, even if it is being tracked by version control. (And that is the main complaint potential users have about version control: “Too time consuming!”)

I think you have to expect that writing well-documented scripts and knitr-based reports that capture the totality of a data analysis project will take a lot of work: perhaps double (or more!) the effort. But it will save a ton of time later (if others care about what you did).

I don’t really want to take this time in the midst of a bout of exploratory data analysis. I find it too inhibiting. So I tend to do a bunch of analyses, capturing the main ideas in a draft R script (or reconstructed later from the .Rhistory file), and then go back later to make a clean knitr-based document that explains what I was doing and why.

It can be hard to force myself to do the clean-up. I wish there were an easier way. But I expect that well-organized lab scientists devote a lot of time to constructing their lab notebooks, too.

Read the source code

6 Aug 2013

The other day, there was a bit of a twitter conversation about qqline in R.

It made me think: how exactly is the line produced by qqline chosen? I seemed to recall that the line was through the first and third quartiles.

An advantage of R is that you can just type the name of the function and see the code:

# qqline
function (y, datax = FALSE, distribution = qnorm, probs = c(0.25,
    0.75), qtype = 7, ...)
{
    stopifnot(length(probs) == 2, is.function(distribution))
    y <- quantile(y, probs, names = FALSE, type = qtype, na.rm = TRUE)
    x <- distribution(probs)
    if (datax) {
        slope <- diff(x)/diff(y)
        int <- x[1L] - slope * y[1L]
    }
    else {
        slope <- diff(y)/diff(x)
        int <- y[1L] - slope * x[1L]
    }
    abline(int, slope, ...)
}

I was right: They take the 25th and 75th percentiles of the data and of the theoretical distribution, calculate the slope and y-intercept of the line that goes through those two points, and use abline to draw the line.

Open source means the source is open, so why not take the time to look at it?

Sometimes typing the name of the function doesn’t tell you much:

# qqnorm
function (y, ...)
UseMethod("qqnorm")

In such cases, you could try typing, for example, qqnorm.default.

Still, the comments (if there were any) get stripped off, and for long functions, it’s not pretty. So I like to keep a copy of the source code (for example, R-3.0.1.tar.gz; extract it with tar xzf R-3.0.1.tar.gz). I use grep to find the relevant files.

For example,

grep -r 'qqline' R-3.0.1/src/

shows that I should look for qqline in

R-3.0.1/src/library/stats/R/qqnorm.R

For something like cor, you might want to do:

grep -r 'cor <-' R-3.0.1/src

Or maybe:

grep -r 'cor <-' R-3.0.1/src/library/stats/R

But for cor, you probably also want to look at the C code, which is in

R-3.0.1/src/library/stats/src/cov.c

You can learn a lot about programming from the source code for R.