It’s widely understood that, in R programming, one should avoid `for`

loops and always try to use `apply`

-type functions.

But this isn’t entirely true. It may have been true for Splus, back in the day: As I recall, that had to do with the entire environment from each iteration being retained in memory.

Here’s a simple example:

> x <- matrix(rnorm(4000*40000), ncol=4000) > system.time({ + mx <- rep(NA, nrow(x)) + for(i in 1:nrow(x)) mx[i] <- max(x[i,]) + }) user system elapsed 3.719 0.446 4.164 > system.time(mx2 <- apply(x, 1, max)) user system elapsed 5.548 1.783 7.333

There’s a great commentary on this point by Uwe Ligges and John Fox in the May, 2008, issue of R News (see the “R help desk”, starting on page 46, and note that R News is now the R Journal).

Also see the related discussion at stackoverflow.

They say that `apply`

can be more readable. It can certainly be more compact, but I usually find a `for`

loop to be more readable, perhaps because I’m a C programmer first and an R programmer second.

A key point, from Ligges and Fox: “Initialize new objects to full length before the loop, rather than increasing their size within the loop.”

3 Apr 2013 at 8:45 am |

– I usually find a for loop to be more readable, perhaps because I’m a C programmer first and an R programmer second.

In a nutshell. If one were a relational database zealot, or prolog coder, by default, then the “set” based approach is far more natural. And given that user typed R code is nearly (always?) slower than internal R C code, use of set implemented syntax is to be preferred. Better yet, if your data is in a sql database (and you’ve wisely not eaten of Eve’s NoSql apple, have you?), do all that munging over there first. Will save a ton effort and headache.

3 Apr 2013 at 2:37 pm |

I guess you forgot the [i] in the mx assignment.

I mean:

> system.time({

+ mx <- rep(NA, nrow(x))

+ for(i in 1:nrow(x)) mx[i] <- max(x[i,])

+ })

3 Apr 2013 at 3:12 pm |

Oops; thanks! I updated the post….

3 Apr 2013 at 11:02 pm |

I also found that loop is faster than *apply in some cases. Especially if the object is very big. (or the returning object is big)

For looping, we can compile the code to boost up the speed. For *apply, we can make it parallel to boost up the speed.

To me, if the work is computation intensive then i would prefer apply. If the data object is large or require nested looping, then i would prefer looping.

Here is an example based on your code :

> library(compiler)

> library(parallel)

>

> n x

> f <- cmpfun(function(x){

+ mx <- rep(NA, nrow(x))

+ for(i in 1:nrow(x)) mx[i]

> system.time({

+ mx <- rep(NA, nrow(x))

+ for(i in 1:nrow(x)) mx[i]

> system.time(mx2

> system.time({ mx3

> cl=makeCluster(detectCores())

> clusterExport(cl,”x”)

>

> system.time(mx2

>

3 Apr 2013 at 11:10 pm |

One more point here,

lapply would be faster than apply

system.time(mx2 <- apply(x, 1, max))

system.time(mx2 <- lapply(1:nrow(x), function(z) max(x[z,]) ))

sorry, the code seems to be truncated. Let me try it again:

n <- 40000

x <- matrix(rnorm(400*n), ncol=400)

f <- cmpfun(function(x){

mx <- rep(NA, nrow(x))

for(i in 1:nrow(x)) mx[i] <- max(x[i,])

})

system.time({

mx <- rep(NA, nrow(x))

for(i in 1:nrow(x)) mx[i] <- max(x[i,])

})

system.time(mx2 <- apply(x, 1, max))

system.time(mx2 <- lapply(1:nrow(x), function(z) max(x[z,]) ))

system.time({ mx3<-f(x)})

cl=makeCluster(detectCores())

clusterExport(cl,"x")

system.time(mx2 <- parSapply(cl,1:nrow(x),function(z) max(x[z,]) ))

5 Apr 2013 at 5:42 am |

Reblogged this on Easy ML World.

16 Feb 2014 at 2:56 am |

Maybe this comes from the R Inferno — “failing to vectorise” is chapter 3, he makes the point about initialising at full size in ch 2 (gluttons), and he pokes fun at “Speaking R with a strong C accent” (or was it the other way around?) Anyway. He’s a popular source. Could be where the conventional wisdom comes from.