# r - Sort (order) data frame rows by multiple columns

ID : 813

viewed : 181

Tags : rsortingdataframer-faqr

### Top 5 Answer for r - Sort (order) data frame rows by multiple columns

98

You can use the `order()` function directly without resorting to add-on tools -- see this simpler answer which uses a trick right from the top of the `example(order)` code:

``R> dd[with(dd, order(-z, b)), ]     b x y z 4 Low C 9 2 2 Med D 3 1 1  Hi A 8 1 3  Hi A 9 1 ``

Edit some 2+ years later: It was just asked how to do this by column index. The answer is to simply pass the desired sorting column(s) to the `order()` function:

``R> dd[order(-dd[,4], dd[,1]), ]     b x y z 4 Low C 9 2 2 Med D 3 1 1  Hi A 8 1 3  Hi A 9 1 R>  ``

rather than using the name of the column (and `with()` for easier/more direct access).

90

• `order` from `base`
• `arrange` from `dplyr`
• `setorder` and `setorderv` from `data.table`
• `arrange` from `plyr`
• `sort` from `taRifx`
• `orderBy` from `doBy`
• `sortData` from `Deducer`

Most of the time you should use the `dplyr` or `data.table` solutions, unless having no-dependencies is important, in which case use `base::order`.

I recently added sort.data.frame to a CRAN package, making it class compatible as discussed here: Best way to create generic/method consistency for sort.data.frame?

Therefore, given the data.frame dd, you can sort as follows:

``dd <- data.frame(b = factor(c("Hi", "Med", "Hi", "Low"),        levels = c("Low", "Med", "Hi"), ordered = TRUE),       x = c("A", "D", "A", "C"), y = c(8, 3, 9, 9),       z = c(1, 1, 1, 2)) library(taRifx) sort(dd, f= ~ -z + b ) ``

If you are one of the original authors of this function, please contact me. Discussion as to public domaininess is here: https://chat.stackoverflow.com/transcript/message/1094290#1094290

You can also use the `arrange()` function from `plyr` as Hadley pointed out in the above thread:

``library(plyr) arrange(dd,desc(z),b) ``

Benchmarks: Note that I loaded each package in a new R session since there were a lot of conflicts. In particular loading the doBy package causes `sort` to return "The following object(s) are masked from 'x (position 17)': b, x, y, z", and loading the Deducer package overwrites `sort.data.frame` from Kevin Wright or the taRifx package.

``#Load each time dd <- data.frame(b = factor(c("Hi", "Med", "Hi", "Low"),        levels = c("Low", "Med", "Hi"), ordered = TRUE),       x = c("A", "D", "A", "C"), y = c(8, 3, 9, 9),       z = c(1, 1, 1, 2)) library(microbenchmark)  # Reload R between benchmarks microbenchmark(dd[with(dd, order(-z, b)), ] ,     dd[order(-dd\$z, dd\$b),],     times=1000 ) ``

Median times:

`dd[with(dd, order(-z, b)), ]` 778

`dd[order(-dd\$z, dd\$b),]` 788

``library(taRifx) microbenchmark(sort(dd, f= ~-z+b ),times=1000) ``

Median time: 1,567

``library(plyr) microbenchmark(arrange(dd,desc(z),b),times=1000) ``

Median time: 862

``library(doBy) microbenchmark(orderBy(~-z+b, data=dd),times=1000) ``

Median time: 1,694

Note that doBy takes a good bit of time to load the package.

``library(Deducer) microbenchmark(sortData(dd,c("z","b"),increasing= c(FALSE,TRUE)),times=1000) ``

Couldn't make Deducer load. Needs JGR console.

``esort <- function(x, sortvar, ...) { attach(x) x <- x[with(x,order(sortvar,...)),] return(x) detach(x) }  microbenchmark(esort(dd, -z, b),times=1000) ``

Doesn't appear to be compatible with microbenchmark due to the attach/detach.

``m <- microbenchmark(   arrange(dd,desc(z),b),   sort(dd, f= ~-z+b ),   dd[with(dd, order(-z, b)), ] ,   dd[order(-dd\$z, dd\$b),],   times=1000   )  uq <- function(x) { fivenum(x)[4]}   lq <- function(x) { fivenum(x)[2]}  y_min <- 0 # min(by(m\$time,m\$expr,lq)) y_max <- max(by(m\$time,m\$expr,uq)) * 1.05    p <- ggplot(m,aes(x=expr,y=time)) + coord_cartesian(ylim = c( y_min , y_max ))  p + stat_summary(fun.y=median,fun.ymin = lq, fun.ymax = uq, aes(fill=expr)) ``

(lines extend from lower quartile to upper quartile, dot is the median)

Given these results and weighing simplicity vs. speed, I'd have to give the nod to `arrange` in the `plyr` package. It has a simple syntax and yet is almost as speedy as the base R commands with their convoluted machinations. Typically brilliant Hadley Wickham work. My only gripe with it is that it breaks the standard R nomenclature where sorting objects get called by `sort(object)`, but I understand why Hadley did it that way due to issues discussed in the question linked above.

73

Dirk's answer is great. It also highlights a key difference in the syntax used for indexing `data.frame`s and `data.table`s:

``## The data.frame way dd[with(dd, order(-z, b)), ]  ## The data.table way: (7 fewer characters, but that's not the important bit) dd[order(-z, b)] ``

The difference between the two calls is small, but it can have important consequences. Especially if you write production code and/or are concerned with correctness in your research, it's best to avoid unnecessary repetition of variable names. `data.table` helps you do this.

Here's an example of how repetition of variable names might get you into trouble:

Let's change the context from Dirk's answer, and say this is part of a bigger project where there are a lot of object names and they are long and meaningful; instead of `dd` it's called `quarterlyreport`. It becomes :

``quarterlyreport[with(quarterlyreport,order(-z,b)),] ``

Ok, fine. Nothing wrong with that. Next your boss asks you to include last quarter's report in the report. You go through your code, adding an object `lastquarterlyreport` in various places and somehow (how on earth?) you end up with this :

``quarterlyreport[with(lastquarterlyreport,order(-z,b)),] ``

That isn't what you meant but you didn't spot it because you did it fast and it's nestled on a page of similar code. The code doesn't fall over (no warning and no error) because R thinks it is what you meant. You'd hope whoever reads your report spots it, but maybe they don't. If you work with programming languages a lot then this situation may be all to familiar. It was a "typo" you'll say. I'll fix the "typo" you'll say to your boss.

In `data.table` we're concerned about tiny details like this. So we've done something simple to avoid typing variable names twice. Something very simple. `i` is evaluated within the frame of `dd` already, automatically. You don't need `with()` at all.

``dd[with(dd, order(-z, b)), ] ``

it's just

``dd[order(-z, b)] ``

``quarterlyreport[with(lastquarterlyreport,order(-z,b)),] ``

it's just

``quarterlyreport[order(-z,b)] ``

It's a very small difference, but it might just save your neck one day. When weighing up the different answers to this question, consider counting the repetitions of variable names as one of your criteria in deciding. Some answers have quite a few repeats, others have none.

67

There are a lot of excellent answers here, but dplyr gives the only syntax that I can quickly and easily remember (and so now use very often):

``library(dplyr) # sort mtcars by mpg, ascending... use desc(mpg) for descending arrange(mtcars, mpg) # sort mtcars first by mpg, then by cyl, then by wt) arrange(mtcars , mpg, cyl, wt) ``

For the OP's problem:

``arrange(dd, desc(z),  b)      b x y z 1 Low C 9 2 2 Med D 3 1 3  Hi A 8 1 4  Hi A 9 1 ``

52

The R package `data.table` provides both fast and memory efficient ordering of data.tables with a straightforward syntax (a part of which Matt has highlighted quite nicely in his answer). There has been quite a lot of improvements and also a new function `setorder()` since then. From `v1.9.5+`, `setorder()` also works with data.frames.

First, we'll create a dataset big enough and benchmark the different methods mentioned from other answers and then list the features of data.table.

### Data:

``require(plyr) require(doBy) require(data.table) require(dplyr) require(taRifx)  set.seed(45L) dat = data.frame(b = as.factor(sample(c("Hi", "Med", "Low"), 1e8, TRUE)),                  x = sample(c("A", "D", "C"), 1e8, TRUE),                  y = sample(100, 1e8, TRUE),                  z = sample(5, 1e8, TRUE),                   stringsAsFactors = FALSE) ``

### Benchmarks:

The timings reported are from running `system.time(...)` on these functions shown below. The timings are tabulated below (in the order of slowest to fastest).

``orderBy( ~ -z + b, data = dat)     ## doBy plyr::arrange(dat, desc(z), b)     ## plyr arrange(dat, desc(z), b)           ## dplyr sort(dat, f = ~ -z + b)            ## taRifx dat[with(dat, order(-z, b)), ]     ## base R  # convert to data.table, by reference setDT(dat)  dat[order(-z, b)]                  ## data.table, base R like syntax setorder(dat, -z, b)               ## data.table, using setorder()                                    ## setorder() now also works with data.frames   # R-session memory usage (BEFORE) = ~2GB (size of 'dat') # ------------------------------------------------------------ # Package      function    Time (s)  Peak memory   Memory used # ------------------------------------------------------------ # doBy          orderBy      409.7        6.7 GB        4.7 GB # taRifx           sort      400.8        6.7 GB        4.7 GB # plyr          arrange      318.8        5.6 GB        3.6 GB  # base R          order      299.0        5.6 GB        3.6 GB # dplyr         arrange       62.7        4.2 GB        2.2 GB # ------------------------------------------------------------ # data.table      order        6.2        4.2 GB        2.2 GB # data.table   setorder        4.5        2.4 GB        0.4 GB # ------------------------------------------------------------ ``
• `data.table`'s `DT[order(...)]` syntax was ~10x faster than the fastest of other methods (`dplyr`), while consuming the same amount of memory as `dplyr`.

• `data.table`'s `setorder()` was ~14x faster than the fastest of other methods (`dplyr`), while taking just 0.4GB extra memory. `dat` is now in the order we require (as it is updated by reference).

### data.table features:

Speed:

• data.table's ordering is extremely fast because it implements radix ordering.

• The syntax `DT[order(...)]` is optimised internally to use data.table's fast ordering as well. You can keep using the familiar base R syntax but speed up the process (and use less memory).

Memory:

• Most of the times, we don't require the original data.frame or data.table after reordering. That is, we usually assign the result back to the same object, for example:

``DF <- DF[order(...)] ``

The issue is that this requires at least twice (2x) the memory of the original object. To be memory efficient, data.table therefore also provides a function `setorder()`.

`setorder()` reorders data.tables `by reference` (in-place), without making any additional copies. It only uses extra memory equal to the size of one column.

Other features:

1. It supports `integer`, `logical`, `numeric`, `character` and even `bit64::integer64` types.

Note that `factor`, `Date`, `POSIXct` etc.. classes are all `integer`/`numeric` types underneath with additional attributes and are therefore supported as well.

2. In base R, we can not use `-` on a character vector to sort by that column in decreasing order. Instead we have to use `-xtfrm(.)`.

However, in data.table, we can just do, for example, `dat[order(-x)]` or `setorder(dat, -x)`.