I spent some time puzzling over row names in data frames in R this morning. It seems that if you make the row names for a data frame, x, as 1:nrow(x), R will act as if you’d not assigned row names, and the names might get changed when you do rbind.
Here’s an illustration:
> x <- data.frame(id=1:3) > y <- data.frame(id=4:6) > rownames(x) <- 1:3 > rownames(y) <- LETTERS[4:6] > rbind(x,y) id 1 1 2 2 3 3 D 4 E 5 F 6 > rbind(y,x) id D 4 E 5 F 6 4 1 5 2 6 3
As you can see, if you give x the row names 1:3, these are treated as generic row numbers and could get changed following rbind if they end up in different rows. This doesn’t happen if x and y are matrices.
I often use row names as identifiers, so it seems I must be cautious to use something other than row numbers.
22 Mar 2012 at 3:23 am |
rownames(x) <- as.character(1:3)
helps, but of course, you need to be aware of it…
22 Mar 2012 at 6:17 am |
Interesting, thanks, but that’s even more mysterious to me.
I’ve been saving keystrokes, skipping the
as.character(), thinking it made no difference.With either
rownames(x) <- 1:3orrownames(x) <- as.character(1:3), if you typerownames(x)the results look the same, andis.character(rownames(x))returns TRUE. Howeverattr(x, "row.names")is different for the two.And why does
rownames(x) <- 1:3get treated so much differently fromrownames(x) <- 2:4?22 Mar 2012 at 9:50 am |
Thanks, wasn’t aware of this. And it seems inconsistent that this only seem to happen when rownames start with 1.
22 Mar 2012 at 9:51 am |
In R, data.frames are required to have rownames. For large data.frame’s there are some overhead (for example space has to be allocated to store the character strings), so some version of R ago, it became possible to do something like
rownames(df) <- NULL
In this case, the data.frame _looks_ like it has rownames 1:nrow(df), but these names are not stored anywhere, instead they are kind of auto-generated (I am not too sure of the details).
I think this is why you see that 1:nrow(df) (as integers) is a very special case. As soon as you use another sequence of integers, it is no longer the special case and it needs to be stored as characters.
22 Mar 2012 at 1:36 pm |
I suppose I should have actually read the documentation. The help file for data.frame is dreadfully dull, but includes:
It also has the following useful Note:
22 Mar 2012 at 10:19 am |
I find those quirks so frustrating! I suspect that someone would have had to have made ordered row names an exception to how row names are handled; which would mean that someone actually spent time and energy to create a bug.
I can see where this would be useful: when merging yet unnamed data frames using cbind. Still, it’s such a nuisance. It has the potential to be a very insidious bug.
I wonder though, do you think this type of problem is more or less prevalent in open source software? Does SAS have similar gotchas? I’m sure that they would be different, but i wonder if they are more or less plentiful in other languages.
22 Mar 2012 at 1:33 pm |
I expect that this sort of thing can happen in any but the simplest piece of software. And I suppose one may question whether it’s a bug or a feature, the main distinction being whether it was intended. It seems here that it was intended and so is a feature rather than a bug.
23 Mar 2012 at 2:50 am |
I support Broman’s statement that it is more a feature than a bug. I use rownames as id generator respectively as id-check. There are two good reasons for doing this:
1) it allows only unique values:
(x <- data.frame(pseudoid = c(1:3,3:1))
rownames(x) <- x$pseudoid
2) it gives nonsense identifier (as you showed with your code). This makes sense since using structures in identifier cause only troubles.
Once I have built a data frame (using rownames(x) <- NULL) I start analysis and I can identify each row through rownames. Usually you don't add new records in an analysis. Look at the data frame as an entity. If you add a new record you get a new entity and this asks for a new definition of rownames.
If you use subsets of the original data set this will (maybe) preserve the rownames:
(x <- data.frame(pseudoid=c(LETTERS[4:6],c(4:6)),v=1:6))
(x1 <- subset(x, v 4))
But since I’m not really sure about my last statement I usually put the identifier of my original data frame in a column:
x$id <- rownames(x)
23 Mar 2012 at 2:52 am |
corrigenda: (x1 <- subset(x, v 4))
23 Mar 2012 at 2:54 am |
corrigenda: (x1 <- subset(x, v lower 3 or v greater 4))
(Sorry about that. How is the syntax for greater, lower?)