Reading Different Files in R
Parag Verma
Introduction
In this Blog we will look at how to read different types of Tabular files in R. Essentially we will look at how to read the follwing:
- csv
- xlsx
- tsv
- txt
- xls
- sas7bdat (sas files)
- sav(spss)
Funtions required to read each file type
Below is a list of various functions that will help us in reading the file
- read.csv (for csv)
- read.xlsx from openxlsx library(for xlsx)
- read.table(for tsv)
- read.table(for txt)
- read.xlsx2 from xlsx library(for xls)
- read_sas from haven library (for sas7bdat)
- read_sav from haven library(for spss)
Installing the library: dplyr,tidyr,stringr and other packages
if(!require("dplyr")){
install.packages("dplyr")
}else{
library(dplyr)
}
if(!require("tidyr")){
install.packages("tidyr")
}else{
library(tidyr)
}
# For detecting file pattern/type
if(!require("stringr")){
install.packages("stringr")
}else{
library(stringr)
}
# For reading xlsx files
if(!require("openxlsx")){
install.packages("openxlsx")
}else{
library(openxlsx)
}
# For reading xls files
if(!require("xlsx")){
install.packages("xlsx")
}else{
library(xlsx)
}
# For reading sas and spss files
if(!require("haven")){
install.packages("haven")
}else{
library(haven)
}
Listing different files in a folder
Lets look at the file within Parag folder in my working directory
lsfile<-list.files(paste0(getwd(),"//Parag"))
lsfile
[1] "file1.xls" "file2.csv" "file3.tsv" "file4.txt"
[5] "file5.xlsx" "file6.sas7bdat" "file7.sav"
We can see that there are 7 different types of files present in the Folder.Lets try and create a function that reads each one of them in a list
Creating the Function and Reading the files
read.batch<-function(x){
# Reading a csv file
if(grepl(".csv",x)){
df.batch<-read.csv(paste0(getwd(),"\\Parag\\",x),stringsAsFactors=F)
return(df.batch)
}else if(grepl(".xlsx\\b",x)){
# Reading an xlsx file
df.batch<-openxlsx::read.xlsx(paste0(getwd(),"\\","Parag\\",x),startRow = 1,colNames = T,rowNames = F)
return(df.batch)
}else if(grepl(".sas7bdat",x)){
# Reading a sas file
df.batch<-read_sas(paste0(getwd(),"\\","Parag\\",x))
return(df.batch)
}else if(grepl(paste0(c(".tsv",".txt"),collapse = "|"),x)){
# Reading a tsv or a text file file
df.batch<-read.table(paste0(getwd(),"\\","Parag\\",x),sep="\t",header=T,stringsAsFactors=F)
return(df.batch)
}else if(grepl(".xls\\b",x)){
# Reading a tsv or a text file file
df.batch<-xlsx::read.xlsx2(paste0(getwd(),"\\","Parag\\",x),header=T,stringsAsFactors=F,sheetIndex=1)
return(df.batch)
}else{
# Reading an SPSS file (.sav)
df.batch<-read_sav(paste0(getwd(),"\\","Parag\\",x))
return(df.batch)
}
}# end of read function
# Running the 'read.csv.batch' function with lapply
batch.files <- lapply(lsfile,read.batch )
# Assigning Names of files read
names(batch.files)<-lsfile
Getting number of columns and records of each file
for(i in names(batch.files)){
print(paste0("Number of rows in ",i," is ",dim(batch.files[[i]])[1]," and column is ",dim(batch.files[[i]])[2]))
}
[1] "Number of rows in file1.xls is 13 and column is 2"
[1] "Number of rows in file2.csv is 2 and column is 3"
[1] "Number of rows in file3.tsv is 41 and column is 3"
[1] "Number of rows in file4.txt is 0 and column is 1"
[1] "Number of rows in file5.xlsx is 62 and column is 17"
[1] "Number of rows in file6.sas7bdat is 150 and column is 5"
[1] "Number of rows in file7.sav is 150 and column is 5"
Final Comments
We can cleary see that the files have been read as we are able to get the number of rows and columns. The main trick here is to create a custom function to read different files and then use it along with lapply to vectorise operation
Link to Previous R Blogs
List of Datasets for Practise
https://hofmann.public.iastate.edu/data_in_r_sortable.html
https://vincentarelbundock.github.io/Rdatasets/datasets.html
No comments:
Post a Comment