Row names in data frames: beware of 1:nrow

I spent some time puzzling over row names in data frames in R this morning. It seems that if you make the row names for a data frame, x, as 1:nrow(x), R will act as if you’d not assigned row names, and the names might get changed when you do rbind.

Here’s an illustration:

> x <- data.frame(id=1:3)
> y <- data.frame(id=4:6)
> rownames(x) <- 1:3
> rownames(y) <- LETTERS[4:6]
> rbind(x,y)
  id
1  1
2  2
3  3
D  4
E  5
F  6
> rbind(y,x)
  id
D  4
E  5
F  6
4  1
5  2
6  3


As you can see, if you give x the row names 1:3, these are treated as generic row numbers and could get changed following rbind if they end up in different rows. This doesn’t happen if x and y are matrices.

I often use row names as identifiers, so it seems I must be cautious to use something other than row numbers.

About these ads

Tags: , ,

10 Responses to “Row names in data frames: beware of 1:nrow”

  1. Rainer Says:

    rownames(x) <- as.character(1:3)

    helps, but of course, you need to be aware of it…

    • Karl Broman Says:

      Interesting, thanks, but that’s even more mysterious to me.

      I’ve been saving keystrokes, skipping the as.character(), thinking it made no difference.

      With either rownames(x) <- 1:3 or rownames(x) <- as.character(1:3), if you type rownames(x) the results look the same, and is.character(rownames(x)) returns TRUE. However attr(x, "row.names") is different for the two.

      And why does rownames(x) <- 1:3 get treated so much differently from rownames(x) <- 2:4?

  2. pvanb Says:

    Thanks, wasn’t aware of this. And it seems inconsistent that this only seem to happen when rownames start with 1.

  3. Kasper Daniel Hansen Says:

    In R, data.frames are required to have rownames. For large data.frame’s there are some overhead (for example space has to be allocated to store the character strings), so some version of R ago, it became possible to do something like
    rownames(df) <- NULL
    In this case, the data.frame _looks_ like it has rownames 1:nrow(df), but these names are not stored anywhere, instead they are kind of auto-generated (I am not too sure of the details).

    I think this is why you see that 1:nrow(df) (as integers) is a very special case. As soon as you use another sequence of integers, it is no longer the special case and it needs to be stored as characters.

    • Karl Broman Says:

      I suppose I should have actually read the documentation. The help file for data.frame is dreadfully dull, but includes:

      If row.names was supplied as NULL or no suitable component was found the row names are the integer sequence starting at one (and such row names are considered to be ‘automatic’, and not preserved by as.matrix).

      It also has the following useful Note:

      In versions of R prior to 2.4.0 row.names had to be character: to ensure compatibility with such versions of R, supply a character vector as the row.names argument.

  4. Gene Says:

    I find those quirks so frustrating! I suspect that someone would have had to have made ordered row names an exception to how row names are handled; which would mean that someone actually spent time and energy to create a bug.

    I can see where this would be useful: when merging yet unnamed data frames using cbind. Still, it’s such a nuisance. It has the potential to be a very insidious bug.

    I wonder though, do you think this type of problem is more or less prevalent in open source software? Does SAS have similar gotchas? I’m sure that they would be different, but i wonder if they are more or less plentiful in other languages.

    • Karl Broman Says:

      I expect that this sort of thing can happen in any but the simplest piece of software. And I suppose one may question whether it’s a bug or a feature, the main distinction being whether it was intended. It seems here that it was intended and so is a feature rather than a bug.

  5. giordano Says:

    I support Broman’s statement that it is more a feature than a bug. I use rownames as id generator respectively as id-check. There are two good reasons for doing this:
    1) it allows only unique values:
    (x <- data.frame(pseudoid = c(1:3,3:1))
    rownames(x) <- x$pseudoid
    2) it gives nonsense identifier (as you showed with your code). This makes sense since using structures in identifier cause only troubles.

    Once I have built a data frame (using rownames(x) <- NULL) I start analysis and I can identify each row through rownames. Usually you don't add new records in an analysis. Look at the data frame as an entity. If you add a new record you get a new entity and this asks for a new definition of rownames.
    If you use subsets of the original data set this will (maybe) preserve the rownames:
    (x <- data.frame(pseudoid=c(LETTERS[4:6],c(4:6)),v=1:6))
    (x1 <- subset(x, v 4))

    But since I’m not really sure about my last statement I usually put the identifier of my original data frame in a column:
    x$id <- rownames(x)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

Join 89 other followers