r - Grouping functions (tapply, by, aggregate) and the *apply family

ID : 1240

viewed : 94

Tags : rlapplysapplytapplyr-faqr

Top 5 Answer for r - Grouping functions (tapply, by, aggregate) and the *apply family

vote vote

100

R has many *apply functions which are ably described in the help files (e.g. ?apply). There are enough of them, though, that beginning useRs may have difficulty deciding which one is appropriate for their situation or even remembering them all. They may have a general sense that "I should be using an *apply function here", but it can be tough to keep them all straight at first.

Despite the fact (noted in other answers) that much of the functionality of the *apply family is covered by the extremely popular plyr package, the base functions remain useful and worth knowing.

This answer is intended to act as a sort of signpost for new useRs to help direct them to the correct *apply function for their particular problem. Note, this is not intended to simply regurgitate or replace the R documentation! The hope is that this answer helps you to decide which *apply function suits your situation and then it is up to you to research it further. With one exception, performance differences will not be addressed.

  • apply - When you want to apply a function to the rows or columns of a matrix (and higher-dimensional analogues); not generally advisable for data frames as it will coerce to a matrix first.

     # Two dimensional matrix  M <- matrix(seq(1,16), 4, 4)   # apply min to rows  apply(M, 1, min)  [1] 1 2 3 4   # apply max to columns  apply(M, 2, max)  [1]  4  8 12 16   # 3 dimensional array  M <- array( seq(32), dim = c(4,4,2))   # Apply sum across each M[*, , ] - i.e Sum across 2nd and 3rd dimension  apply(M, 1, sum)  # Result is one-dimensional  [1] 120 128 136 144   # Apply sum across each M[*, *, ] - i.e Sum across 3rd dimension  apply(M, c(1,2), sum)  # Result is two-dimensional       [,1] [,2] [,3] [,4]  [1,]   18   26   34   42  [2,]   20   28   36   44  [3,]   22   30   38   46  [4,]   24   32   40   48 

    If you want row/column means or sums for a 2D matrix, be sure to investigate the highly optimized, lightning-quick colMeans, rowMeans, colSums, rowSums.

  • lapply - When you want to apply a function to each element of a list in turn and get a list back.

    This is the workhorse of many of the other *apply functions. Peel back their code and you will often find lapply underneath.

     x <- list(a = 1, b = 1:3, c = 10:100)   lapply(x, FUN = length)   $a   [1] 1  $b   [1] 3  $c   [1] 91  lapply(x, FUN = sum)   $a   [1] 1  $b   [1] 6  $c   [1] 5005 
  • sapply - When you want to apply a function to each element of a list in turn, but you want a vector back, rather than a list.

    If you find yourself typing unlist(lapply(...)), stop and consider sapply.

     x <- list(a = 1, b = 1:3, c = 10:100)  # Compare with above; a named vector, not a list   sapply(x, FUN = length)    a  b  c     1  3 91   sapply(x, FUN = sum)     a    b    c      1    6 5005  

    In more advanced uses of sapply it will attempt to coerce the result to a multi-dimensional array, if appropriate. For example, if our function returns vectors of the same length, sapply will use them as columns of a matrix:

     sapply(1:5,function(x) rnorm(3,x)) 

    If our function returns a 2 dimensional matrix, sapply will do essentially the same thing, treating each returned matrix as a single long vector:

     sapply(1:5,function(x) matrix(x,2,2)) 

    Unless we specify simplify = "array", in which case it will use the individual matrices to build a multi-dimensional array:

     sapply(1:5,function(x) matrix(x,2,2), simplify = "array") 

    Each of these behaviors is of course contingent on our function returning vectors or matrices of the same length or dimension.

  • vapply - When you want to use sapply but perhaps need to squeeze some more speed out of your code or want more type safety.

    For vapply, you basically give R an example of what sort of thing your function will return, which can save some time coercing returned values to fit in a single atomic vector.

     x <- list(a = 1, b = 1:3, c = 10:100)  #Note that since the advantage here is mainly speed, this  # example is only for illustration. We're telling R that  # everything returned by length() should be an integer of   # length 1.   vapply(x, FUN = length, FUN.VALUE = 0L)   a  b  c    1  3 91 
  • mapply - For when you have several data structures (e.g. vectors, lists) and you want to apply a function to the 1st elements of each, and then the 2nd elements of each, etc., coercing the result to a vector/array as in sapply.

    This is multivariate in the sense that your function must accept multiple arguments.

     #Sums the 1st elements, the 2nd elements, etc.   mapply(sum, 1:5, 1:5, 1:5)   [1]  3  6  9 12 15  #To do rep(1,4), rep(2,3), etc.  mapply(rep, 1:4, 4:1)     [[1]]  [1] 1 1 1 1   [[2]]  [1] 2 2 2   [[3]]  [1] 3 3   [[4]]  [1] 4 
  • Map - A wrapper to mapply with SIMPLIFY = FALSE, so it is guaranteed to return a list.

     Map(sum, 1:5, 1:5, 1:5)  [[1]]  [1] 3   [[2]]  [1] 6   [[3]]  [1] 9   [[4]]  [1] 12   [[5]]  [1] 15 
  • rapply - For when you want to apply a function to each element of a nested list structure, recursively.

    To give you some idea of how uncommon rapply is, I forgot about it when first posting this answer! Obviously, I'm sure many people use it, but YMMV. rapply is best illustrated with a user-defined function to apply:

     # Append ! to string, otherwise increment  myFun <- function(x){      if(is.character(x)){        return(paste(x,"!",sep=""))      }      else{        return(x + 1)      }  }   #A nested list structure  l <- list(a = list(a1 = "Boo", b1 = 2, c1 = "Eeek"),             b = 3, c = "Yikes",             d = list(a2 = 1, b2 = list(a3 = "Hey", b3 = 5)))    # Result is named vector, coerced to character            rapply(l, myFun)   # Result is a nested list like l, with values altered  rapply(l, myFun, how="replace") 
  • tapply - For when you want to apply a function to subsets of a vector and the subsets are defined by some other vector, usually a factor.

    The black sheep of the *apply family, of sorts. The help file's use of the phrase "ragged array" can be a bit confusing, but it is actually quite simple.

    A vector:

     x <- 1:20 

    A factor (of the same length!) defining groups:

     y <- factor(rep(letters[1:5], each = 4)) 

    Add up the values in x within each subgroup defined by y:

     tapply(x, y, sum)     a  b  c  d  e    10 26 42 58 74  

    More complex examples can be handled where the subgroups are defined by the unique combinations of a list of several factors. tapply is similar in spirit to the split-apply-combine functions that are common in R (aggregate, by, ave, ddply, etc.) Hence its black sheep status.

vote vote

90

On the side note, here is how the various plyr functions correspond to the base *apply functions (from the intro to plyr document from the plyr webpage http://had.co.nz/plyr/)

Base function   Input   Output   plyr function  --------------------------------------- aggregate        d       d       ddply + colwise  apply            a       a/l     aaply / alply  by               d       l       dlply  lapply           l       l       llply   mapply           a       a/l     maply / mlply  replicate        r       a/l     raply / rlply  sapply           l       a       laply  

One of the goals of plyr is to provide consistent naming conventions for each of the functions, encoding the input and output data types in the function name. It also provides consistency in output, in that output from dlply() is easily passable to ldply() to produce useful output, etc.

Conceptually, learning plyr is no more difficult than understanding the base *apply functions.

plyr and reshape functions have replaced almost all of these functions in my every day use. But, also from the Intro to Plyr document:

Related functions tapply and sweep have no corresponding function in plyr, and remain useful. merge is useful for combining summaries with the original data.

vote vote

74

From slide 21 of http://www.slideshare.net/hadley/plyr-one-data-analytic-strategy:

apply, sapply, lapply, by, aggregate

(Hopefully it's clear that apply corresponds to @Hadley's aaply and aggregate corresponds to @Hadley's ddply etc. Slide 20 of the same slideshare will clarify if you don't get it from this image.)

(on the left is input, on the top is output)

vote vote

63

First start with Joran's excellent answer -- doubtful anything can better that.

Then the following mnemonics may help to remember the distinctions between each. Whilst some are obvious, others may be less so --- for these you'll find justification in Joran's discussions.

Mnemonics

  • lapply is a list apply which acts on a list or vector and returns a list.
  • sapply is a simple lapply (function defaults to returning a vector or matrix when possible)
  • vapply is a verified apply (allows the return object type to be prespecified)
  • rapply is a recursive apply for nested lists, i.e. lists within lists
  • tapply is a tagged apply where the tags identify the subsets
  • apply is generic: applies a function to a matrix's rows or columns (or, more generally, to dimensions of an array)

Building the Right Background

If using the apply family still feels a bit alien to you, then it might be that you're missing a key point of view.

These two articles can help. They provide the necessary background to motivate the functional programming techniques that are being provided by the apply family of functions.

Users of Lisp will recognise the paradigm immediately. If you're not familiar with Lisp, once you get your head around FP, you'll have gained a powerful point of view for use in R -- and apply will make a lot more sense.

vote vote

52

Since I realized that (the very excellent) answers of this post lack of by and aggregate explanations. Here is my contribution.

BY

The by function, as stated in the documentation can be though, as a "wrapper" for tapply. The power of by arises when we want to compute a task that tapply can't handle. One example is this code:

ct <- tapply(iris$Sepal.Width , iris$Species , summary ) cb <- by(iris$Sepal.Width , iris$Species , summary )   cb iris$Species: setosa    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    2.300   3.200   3.400   3.428   3.675   4.400  --------------------------------------------------------------  iris$Species: versicolor    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    2.000   2.525   2.800   2.770   3.000   3.400  --------------------------------------------------------------  iris$Species: virginica    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    2.200   2.800   3.000   2.974   3.175   3.800    ct $setosa    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    2.300   3.200   3.400   3.428   3.675   4.400   $versicolor    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    2.000   2.525   2.800   2.770   3.000   3.400   $virginica    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    2.200   2.800   3.000   2.974   3.175   3.800  

If we print these two objects, ct and cb, we "essentially" have the same results and the only differences are in how they are shown and the different class attributes, respectively by for cb and array for ct.

As I've said, the power of by arises when we can't use tapply; the following code is one example:

 tapply(iris, iris$Species, summary ) Error in tapply(iris, iris$Species, summary) :    arguments must have same length 

R says that arguments must have the same lengths, say "we want to calculate the summary of all variable in iris along the factor Species": but R just can't do that because it does not know how to handle.

With the by function R dispatch a specific method for data frame class and then let the summary function works even if the length of the first argument (and the type too) are different.

bywork <- by(iris, iris$Species, summary )  bywork iris$Species: setosa   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species    Min.   :4.300   Min.   :2.300   Min.   :1.000   Min.   :0.100   setosa    :50    1st Qu.:4.800   1st Qu.:3.200   1st Qu.:1.400   1st Qu.:0.200   versicolor: 0    Median :5.000   Median :3.400   Median :1.500   Median :0.200   virginica : 0    Mean   :5.006   Mean   :3.428   Mean   :1.462   Mean   :0.246                    3rd Qu.:5.200   3rd Qu.:3.675   3rd Qu.:1.575   3rd Qu.:0.300                    Max.   :5.800   Max.   :4.400   Max.   :1.900   Max.   :0.600                   --------------------------------------------------------------  iris$Species: versicolor   Sepal.Length    Sepal.Width     Petal.Length   Petal.Width          Species    Min.   :4.900   Min.   :2.000   Min.   :3.00   Min.   :1.000   setosa    : 0    1st Qu.:5.600   1st Qu.:2.525   1st Qu.:4.00   1st Qu.:1.200   versicolor:50    Median :5.900   Median :2.800   Median :4.35   Median :1.300   virginica : 0    Mean   :5.936   Mean   :2.770   Mean   :4.26   Mean   :1.326                    3rd Qu.:6.300   3rd Qu.:3.000   3rd Qu.:4.60   3rd Qu.:1.500                    Max.   :7.000   Max.   :3.400   Max.   :5.10   Max.   :1.800                   --------------------------------------------------------------  iris$Species: virginica   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species    Min.   :4.900   Min.   :2.200   Min.   :4.500   Min.   :1.400   setosa    : 0    1st Qu.:6.225   1st Qu.:2.800   1st Qu.:5.100   1st Qu.:1.800   versicolor: 0    Median :6.500   Median :3.000   Median :5.550   Median :2.000   virginica :50    Mean   :6.588   Mean   :2.974   Mean   :5.552   Mean   :2.026                    3rd Qu.:6.900   3rd Qu.:3.175   3rd Qu.:5.875   3rd Qu.:2.300                    Max.   :7.900   Max.   :3.800   Max.   :6.900   Max.   :2.500      

it works indeed and the result is very surprising. It is an object of class by that along Species (say, for each of them) computes the summary of each variable.

Note that if the first argument is a data frame, the dispatched function must have a method for that class of objects. For example is we use this code with the mean function we will have this code that has no sense at all:

 by(iris, iris$Species, mean) iris$Species: setosa [1] NA -------------------------------------------  iris$Species: versicolor [1] NA -------------------------------------------  iris$Species: virginica [1] NA Warning messages: 1: In mean.default(data[x, , drop = FALSE], ...) :   argument is not numeric or logical: returning NA 2: In mean.default(data[x, , drop = FALSE], ...) :   argument is not numeric or logical: returning NA 3: In mean.default(data[x, , drop = FALSE], ...) :   argument is not numeric or logical: returning NA 

AGGREGATE

aggregate can be seen as another a different way of use tapply if we use it in such a way.

at <- tapply(iris$Sepal.Length , iris$Species , mean) ag <- aggregate(iris$Sepal.Length , list(iris$Species), mean)   at     setosa versicolor  virginica       5.006      5.936      6.588   ag      Group.1     x 1     setosa 5.006 2 versicolor 5.936 3  virginica 6.588 

The two immediate differences are that the second argument of aggregate must be a list while tapply can (not mandatory) be a list and that the output of aggregate is a data frame while the one of tapply is an array.

The power of aggregate is that it can handle easily subsets of the data with subset argument and that it has methods for ts objects and formula as well.

These elements make aggregate easier to work with that tapply in some situations. Here are some examples (available in documentation):

ag <- aggregate(len ~ ., data = ToothGrowth, mean)   ag   supp dose   len 1   OJ  0.5 13.23 2   VC  0.5  7.98 3   OJ  1.0 22.70 4   VC  1.0 16.77 5   OJ  2.0 26.06 6   VC  2.0 26.14 

We can achieve the same with tapply but the syntax is slightly harder and the output (in some circumstances) less readable:

att <- tapply(ToothGrowth$len, list(ToothGrowth$dose, ToothGrowth$supp), mean)   att        OJ    VC 0.5 13.23  7.98 1   22.70 16.77 2   26.06 26.14 

There are other times when we can't use by or tapply and we have to use aggregate.

 ag1 <- aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean)   ag1   Month    Ozone     Temp 1     5 23.61538 66.73077 2     6 29.44444 78.22222 3     7 59.11538 83.88462 4     8 59.96154 83.96154 5     9 31.44828 76.89655 

We cannot obtain the previous result with tapply in one call but we have to calculate the mean along Month for each elements and then combine them (also note that we have to call the na.rm = TRUE, because the formula methods of the aggregate function has by default the na.action = na.omit):

ta1 <- tapply(airquality$Ozone, airquality$Month, mean, na.rm = TRUE) ta2 <- tapply(airquality$Temp, airquality$Month, mean, na.rm = TRUE)   cbind(ta1, ta2)        ta1      ta2 5 23.61538 65.54839 6 29.44444 79.10000 7 59.11538 83.90323 8 59.96154 83.96774 9 31.44828 76.90000 

while with by we just can't achieve that in fact the following function call returns an error (but most likely it is related to the supplied function, mean):

by(airquality[c("Ozone", "Temp")], airquality$Month, mean, na.rm = TRUE) 

Other times the results are the same and the differences are just in the class (and then how it is shown/printed and not only -- example, how to subset it) object:

byagg <- by(airquality[c("Ozone", "Temp")], airquality$Month, summary) aggagg <- aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, summary) 

The previous code achieve the same goal and results, at some points what tool to use is just a matter of personal tastes and needs; the previous two objects have very different needs in terms of subsetting.

Top 3 video Explaining r - Grouping functions (tapply, by, aggregate) and the *apply family

Related QUESTION?