Read and Write Huge Dataset
Parag Verma
Introduction
In this Blog my aim is to introduce a quick hack in R. We will look at how to read and write huge dataset from/to drive.There is a very powerful fread and fwrite function available in data.table library. It imports/export even dataset with millions of records quite easily. We will look at some inbuilt dataset in R and record time while doing so.
Installing the library: dplyr and tidyr
if(!require("dplyr")){
install.packages("dplyr")
}else{
library(dplyr)
}
if(!require("tidyr")){
install.packages("tidyr")
}else{
library(tidyr)
}
Importing the dataset
For this exercise we will look at the FARS dataset which is related to US Births in 1969 - 1988.It has 372864 records with 7 columns
# Ecdat library for importing the dataset
if(!require("mosaicData")){
install.packages("mosaicData")
}else{
library(mosaicData)
}
data(Birthdays)
df<-Birthdays
dim(df)
[1] 372864 7
Writing the Data set to drive
if(!require("data.table")){
install.packages("data.table")
}else{
library(data.table)
}
t1<-Sys.time()
data.table::fwrite(df,"dummy.csv",row.names=F)
Sys.time()-t1
Time difference of 0.02991986 secs
It took 0.3 secs to write the dataset
Reading the Data set to drive
t1<-Sys.time()
data.table::fread("dummy.csv")
state year month day date wday births
1: AK 1969 1 1 1969-01-01T00:00:00Z Wed 14
2: AL 1969 1 1 1969-01-01T00:00:00Z Wed 174
3: AR 1969 1 1 1969-01-01T00:00:00Z Wed 78
4: AZ 1969 1 1 1969-01-01T00:00:00Z Wed 84
5: CA 1969 1 1 1969-01-01T00:00:00Z Wed 824
---
372860: VT 1988 12 31 1988-12-31T00:00:00Z Sat 21
372861: WA 1988 12 31 1988-12-31T00:00:00Z Sat 157
372862: WI 1988 12 31 1988-12-31T00:00:00Z Sat 167
372863: WV 1988 12 31 1988-12-31T00:00:00Z Sat 45
372864: WY 1988 12 31 1988-12-31T00:00:00Z Sat 18
Sys.time()-t1
Time difference of 0.1196802 secs
It took around 0.32 secs
Final Comments
We saw an example where we can read as well write a file of decent size from/to the working directory.
Link to Previous R Blogs
List of Datasets for Practise
https://hofmann.public.iastate.edu/data_in_r_sortable.html
https://vincentarelbundock.github.io/Rdatasets/datasets.html