r - Quickly reading very large tables as dataframes

ID : 4105

viewed : 148

Tags : rimportdataframer-faqr





Top 5 Answer for r - Quickly reading very large tables as dataframes

vote vote

100

An update, several years later

This answer is old, and R has moved on. Tweaking read.table to run a bit faster has precious little benefit. Your options are:

  1. Using vroom from the tidyverse package vroom for importing data from csv/tab-delimited files directly into an R tibble. See Hector's answer.

  2. Using fread in data.table for importing data from csv/tab-delimited files directly into R. See mnel's answer.

  3. Using read_table in readr (on CRAN from April 2015). This works much like fread above. The readme in the link explains the difference between the two functions (readr currently claims to be "1.5-2x slower" than data.table::fread).

  4. read.csv.raw from iotools provides a third option for quickly reading CSV files.

  5. Trying to store as much data as you can in databases rather than flat files. (As well as being a better permanent storage medium, data is passed to and from R in a binary format, which is faster.) read.csv.sql in the sqldf package, as described in JD Long's answer, imports data into a temporary SQLite database and then reads it into R. See also: the RODBC package, and the reverse depends section of the DBI package page. MonetDB.R gives you a data type that pretends to be a data frame but is really a MonetDB underneath, increasing performance. Import data with its monetdb.read.csv function. dplyr allows you to work directly with data stored in several types of database.

  6. Storing data in binary formats can also be useful for improving performance. Use saveRDS/readRDS (see below), the h5 or rhdf5 packages for HDF5 format, or write_fst/read_fst from the fst package.


The original answer

There are a couple of simple things to try, whether you use read.table or scan.

  1. Set nrows=the number of records in your data (nmax in scan).

  2. Make sure that comment.char="" to turn off interpretation of comments.

  3. Explicitly define the classes of each column using colClasses in read.table.

  4. Setting multi.line=FALSE may also improve performance in scan.

If none of these thing work, then use one of the profiling packages to determine which lines are slowing things down. Perhaps you can write a cut down version of read.table based on the results.

The other alternative is filtering your data before you read it into R.

Or, if the problem is that you have to read it in regularly, then use these methods to read the data in once, then save the data frame as a binary blob with save saveRDS, then next time you can retrieve it faster with load readRDS.

vote vote

89

Here is an example that utilizes fread from data.table 1.8.7

The examples come from the help page to fread, with the timings on my windows XP Core 2 duo E8400.

library(data.table) # Demo speedup n=1e6 DT = data.table( a=sample(1:1000,n,replace=TRUE),                  b=sample(1:1000,n,replace=TRUE),                  c=rnorm(n),                  d=sample(c("foo","bar","baz","qux","quux"),n,replace=TRUE),                  e=rnorm(n),                  f=sample(1:1000,n,replace=TRUE) ) DT[2,b:=NA_integer_] DT[4,c:=NA_real_] DT[3,d:=NA_character_] DT[5,d:=""] DT[2,e:=+Inf] DT[3,e:=-Inf] 

standard read.table

write.table(DT,"test.csv",sep=",",row.names=FALSE,quote=FALSE) cat("File size (MB):",round(file.info("test.csv")$size/1024^2),"\n")     ## File size (MB): 51   system.time(DF1 <- read.csv("test.csv",stringsAsFactors=FALSE))         ##    user  system elapsed  ##   24.71    0.15   25.42 # second run will be faster system.time(DF1 <- read.csv("test.csv",stringsAsFactors=FALSE))         ##    user  system elapsed  ##   17.85    0.07   17.98 

optimized read.table

system.time(DF2 <- read.table("test.csv",header=TRUE,sep=",",quote="",                             stringsAsFactors=FALSE,comment.char="",nrows=n,                                              colClasses=c("integer","integer","numeric",                                                                "character","numeric","integer")))   ##    user  system elapsed  ##   10.20    0.03   10.32 

fread

require(data.table) system.time(DT <- fread("test.csv"))                                    ##    user  system elapsed  ##    3.12    0.01    3.22 

sqldf

require(sqldf)  system.time(SQLDF <- read.csv.sql("test.csv",dbname=NULL))               ##    user  system elapsed  ##   12.49    0.09   12.69  # sqldf as on SO  f <- file("test.csv") system.time(SQLf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)))  ##    user  system elapsed  ##   10.21    0.47   10.73 

ff / ffdf

 require(ff)   system.time(FFDF <- read.csv.ffdf(file="test.csv",nrows=n))     ##    user  system elapsed   ##   10.85    0.10   10.99 

In summary:

##    user  system elapsed  Method ##   24.71    0.15   25.42  read.csv (first time) ##   17.85    0.07   17.98  read.csv (second time) ##   10.20    0.03   10.32  Optimized read.table ##    3.12    0.01    3.22  fread ##   12.49    0.09   12.69  sqldf ##   10.21    0.47   10.73  sqldf on SO ##   10.85    0.10   10.99  ffdf 
vote vote

71

I didn't see this question initially and asked a similar question a few days later. I am going to take my previous question down, but I thought I'd add an answer here to explain how I used sqldf() to do this.

There's been little bit of discussion as to the best way to import 2GB or more of text data into an R data frame. Yesterday I wrote a blog post about using sqldf() to import the data into SQLite as a staging area, and then sucking it from SQLite into R. This works really well for me. I was able to pull in 2GB (3 columns, 40mm rows) of data in < 5 minutes. By contrast, the read.csv command ran all night and never completed.

Here's my test code:

Set up the test data:

bigdf <- data.frame(dim=sample(letters, replace=T, 4e7), fact1=rnorm(4e7), fact2=rnorm(4e7, 20, 50)) write.csv(bigdf, 'bigdf.csv', quote = F) 

I restarted R before running the following import routine:

library(sqldf) f <- file("bigdf.csv") system.time(bigdf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F))) 

I let the following line run all night but it never completed:

system.time(big.df <- read.csv('bigdf.csv')) 
vote vote

65

Strangely, no one answered the bottom part of the question for years even though this is an important one -- data.frames are simply lists with the right attributes, so if you have large data you don't want to use as.data.frame or similar for a list. It's much faster to simply "turn" a list into a data frame in-place:

attr(df, "row.names") <- .set_row_names(length(df[[1]])) class(df) <- "data.frame" 

This makes no copy of the data so it's immediate (unlike all other methods). It assumes that you have already set names() on the list accordingly.

[As for loading large data into R -- personally, I dump them by column into binary files and use readBin() - that is by far the fastest method (other than mmapping) and is only limited by the disk speed. Parsing ASCII files is inherently slow (even in C) compared to binary data.]

vote vote

59

This was previously asked on R-Help, so that's worth reviewing.

One suggestion there was to use readChar() and then do string manipulation on the result with strsplit() and substr(). You can see the logic involved in readChar is much less than read.table.

I don't know if memory is an issue here, but you might also want to take a look at the HadoopStreaming package. This uses Hadoop, which is a MapReduce framework designed for dealing with large data sets. For this, you would use the hsTableReader function. This is an example (but it has a learning curve to learn Hadoop):

str <- "key1\t3.9\nkey1\t8.9\nkey1\t1.2\nkey1\t3.9\nkey1\t8.9\nkey1\t1.2\nkey2\t9.9\nkey2\" cat(str) cols = list(key='',val=0) con <- textConnection(str, open = "r") hsTableReader(con,cols,chunkSize=6,FUN=print,ignoreKey=TRUE) close(con) 

The basic idea here is to break the data import into chunks. You could even go so far as to use one of the parallel frameworks (e.g. snow) and run the data import in parallel by segmenting the file, but most likely for large data sets that won't help since you will run into memory constraints, which is why map-reduce is a better approach.

Top 3 video Explaining r - Quickly reading very large tables as dataframes







Related QUESTION?