Sunday, March 29, 2020

Blog 17: Reading Different Types of Files

Reading Different Files in R


Introduction

In this Blog we will look at how to read different types of Tabular files in R. Essentially we will look at how to read the follwing:

  • csv
  • xlsx
  • tsv
  • txt
  • xls
  • sas7bdat (sas files)
  • sav(spss)


Funtions required to read each file type

Below is a list of various functions that will help us in reading the file

  • read.csv (for csv)
  • read.xlsx from openxlsx library(for xlsx)
  • read.table(for tsv)
  • read.table(for txt)
  • read.xlsx2 from xlsx library(for xls)
  • read_sas from haven library (for sas7bdat)
  • read_sav from haven library(for spss)


Installing the library: dplyr,tidyr,stringr and other packages

if(!require("dplyr")){
  
  install.packages("dplyr")
}else{
  
  library(dplyr)
}

if(!require("tidyr")){
  
  install.packages("tidyr")
}else{
  
  library(tidyr)
}

# For detecting file pattern/type
if(!require("stringr")){
  
  install.packages("stringr")
}else{
  
  library(stringr)
}

# For reading xlsx files
if(!require("openxlsx")){
  
  install.packages("openxlsx")
}else{
  
  library(openxlsx)
}

# For reading xls files
if(!require("xlsx")){
  
  install.packages("xlsx")
}else{
  
  library(xlsx)
}


# For reading sas and spss files
if(!require("haven")){
  
  install.packages("haven")
}else{
  
  library(haven)
}


Listing different files in a folder

Lets look at the file within Parag folder in my working directory

lsfile<-list.files(paste0(getwd(),"//Parag"))
lsfile
[1] "file1.xls"      "file2.csv"      "file3.tsv"      "file4.txt"     
[5] "file5.xlsx"     "file6.sas7bdat" "file7.sav"     

We can see that there are 7 different types of files present in the Folder.Lets try and create a function that reads each one of them in a list


Creating the Function and Reading the files

read.batch<-function(x){

  # Reading a csv file
  if(grepl(".csv",x)){

  df.batch<-read.csv(paste0(getwd(),"\\Parag\\",x),stringsAsFactors=F)
  return(df.batch)

  }else if(grepl(".xlsx\\b",x)){

     # Reading an xlsx file
  df.batch<-openxlsx::read.xlsx(paste0(getwd(),"\\","Parag\\",x),startRow = 1,colNames = T,rowNames = F)
  return(df.batch)

  }else if(grepl(".sas7bdat",x)){

    # Reading a sas file
  df.batch<-read_sas(paste0(getwd(),"\\","Parag\\",x))
  return(df.batch)

  }else if(grepl(paste0(c(".tsv",".txt"),collapse = "|"),x)){

    # Reading a tsv or a text file file
  df.batch<-read.table(paste0(getwd(),"\\","Parag\\",x),sep="\t",header=T,stringsAsFactors=F)
  return(df.batch)

  }else if(grepl(".xls\\b",x)){

    # Reading a tsv or a text file file
  df.batch<-xlsx::read.xlsx2(paste0(getwd(),"\\","Parag\\",x),header=T,stringsAsFactors=F,sheetIndex=1)
  return(df.batch)

  }else{

    # Reading an SPSS file (.sav)
  df.batch<-read_sav(paste0(getwd(),"\\","Parag\\",x))
  return(df.batch)

  }


}# end of read function

# Running the 'read.csv.batch' function with lapply
batch.files <- lapply(lsfile,read.batch )

# Assigning Names of files read
names(batch.files)<-lsfile


Getting number of columns and records of each file

for(i in names(batch.files)){
  
  
  print(paste0("Number of rows in ",i," is ",dim(batch.files[[i]])[1]," and column is ",dim(batch.files[[i]])[2]))
  
}
[1] "Number of rows in file1.xls is 13 and column is 2"
[1] "Number of rows in file2.csv is 2 and column is 3"
[1] "Number of rows in file3.tsv is 41 and column is 3"
[1] "Number of rows in file4.txt is 0 and column is 1"
[1] "Number of rows in file5.xlsx is 62 and column is 17"
[1] "Number of rows in file6.sas7bdat is 150 and column is 5"
[1] "Number of rows in file7.sav is 150 and column is 5"


Final Comments

We can cleary see that the files have been read as we are able to get the number of rows and columns. The main trick here is to create a custom function to read different files and then use it along with lapply to vectorise operation


No comments:

Post a Comment

Embed Shiny

Please wait...