It’s widely understood that, in R programming, one should avoid for
loops and always try to use apply
-type functions.
But this isn’t entirely true. It may have been true for Splus, back in the day: As I recall, that had to do with the entire environment from each iteration being retained in memory.
Here’s a simple example:
> x <- matrix(rnorm(4000*40000), ncol=4000) > system.time({ + mx <- rep(NA, nrow(x)) + for(i in 1:nrow(x)) mx[i] <- max(x[i,]) + }) user system elapsed 3.719 0.446 4.164 > system.time(mx2 <- apply(x, 1, max)) user system elapsed 5.548 1.783 7.333
There’s a great commentary on this point by Uwe Ligges and John Fox in the May, 2008, issue of R News (see the “R help desk”, starting on page 46, and note that R News is now the R Journal).
Also see the related discussion at stackoverflow.
They say that apply
can be more readable. It can certainly be more compact, but I usually find a for
loop to be more readable, perhaps because I’m a C programmer first and an R programmer second.
A key point, from Ligges and Fox: “Initialize new objects to full length before the loop, rather than increasing their size within the loop.”
3 Apr 2013 at 8:45 am
— I usually find a for loop to be more readable, perhaps because I’m a C programmer first and an R programmer second.
In a nutshell. If one were a relational database zealot, or prolog coder, by default, then the “set” based approach is far more natural. And given that user typed R code is nearly (always?) slower than internal R C code, use of set implemented syntax is to be preferred. Better yet, if your data is in a sql database (and you’ve wisely not eaten of Eve’s NoSql apple, have you?), do all that munging over there first. Will save a ton effort and headache.
3 Apr 2013 at 2:37 pm
I guess you forgot the [i] in the mx assignment.
I mean:
> system.time({
+ mx <- rep(NA, nrow(x))
+ for(i in 1:nrow(x)) mx[i] <- max(x[i,])
+ })
3 Apr 2013 at 3:12 pm
Oops; thanks! I updated the post….
3 Apr 2013 at 11:02 pm
I also found that loop is faster than *apply in some cases. Especially if the object is very big. (or the returning object is big)
For looping, we can compile the code to boost up the speed. For *apply, we can make it parallel to boost up the speed.
To me, if the work is computation intensive then i would prefer apply. If the data object is large or require nested looping, then i would prefer looping.
Here is an example based on your code :
> library(compiler)
> library(parallel)
>
> n x
> f <- cmpfun(function(x){
+ mx <- rep(NA, nrow(x))
+ for(i in 1:nrow(x)) mx[i]
> system.time({
+ mx <- rep(NA, nrow(x))
+ for(i in 1:nrow(x)) mx[i]
> system.time(mx2
> system.time({ mx3
> cl=makeCluster(detectCores())
> clusterExport(cl,”x”)
>
> system.time(mx2
>
3 Apr 2013 at 11:10 pm
One more point here,
lapply would be faster than apply
system.time(mx2 <- apply(x, 1, max))
system.time(mx2 <- lapply(1:nrow(x), function(z) max(x[z,]) ))
sorry, the code seems to be truncated. Let me try it again:
n <- 40000
x <- matrix(rnorm(400*n), ncol=400)
f <- cmpfun(function(x){
mx <- rep(NA, nrow(x))
for(i in 1:nrow(x)) mx[i] <- max(x[i,])
})
system.time({
mx <- rep(NA, nrow(x))
for(i in 1:nrow(x)) mx[i] <- max(x[i,])
})
system.time(mx2 <- apply(x, 1, max))
system.time(mx2 <- lapply(1:nrow(x), function(z) max(x[z,]) ))
system.time({ mx3<-f(x)})
cl=makeCluster(detectCores())
clusterExport(cl,"x")
system.time(mx2 <- parSapply(cl,1:nrow(x),function(z) max(x[z,]) ))
5 Apr 2013 at 5:42 am
Reblogged this on Easy ML World.
16 Feb 2014 at 2:56 am
Maybe this comes from the R Inferno — “failing to vectorise” is chapter 3, he makes the point about initialising at full size in ch 2 (gluttons), and he pokes fun at “Speaking R with a strong C accent” (or was it the other way around?) Anyway. He’s a popular source. Could be where the conventional wisdom comes from.