Wednesday, December 4, 2019

Blog 1 Basic of R: Vectors, Matrics, Lists and Data Frame

Different Data Structures in R



Vectors:Initialization, length and Indexing

Vectors are the most basic entities in R.They are the building blocks of storing information. We can store various type of information/values in them. This includes numbers, text, logical values and so on. Lets look at how to create vectors and what are the different types.

# Let us define vectors p,q and r
p <- c(1,2,5.3,6,-2,4) # numeric vector
q <- c("one","two","three") # character vector
r <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) #logical vector
# The 'c' before the bracket is used for combining the values mentioned in the braces

# Lets check what is there in vector 'p'
p
## [1]  1.0  2.0  5.3  6.0 -2.0  4.0
class(p) # class gives the nature of values stored in 'p' vector
## [1] "numeric"
# Lets check what is there in vector 'q'
q
## [1] "one"   "two"   "three"
class(q)
## [1] "character"
# Lets check what is there in vector 'r'
r
## [1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE
class(r)
## [1] "logical"
# Refer to elements of a vector using subscripts. 
# Accessing the Third element within p
p[3]
## [1] 5.3
# Accessing the Third and fourth element
p[c(3,4)]
## [1] 5.3 6.0
# Let us calculate the length of the vector p
length(p)
## [1] 6
# On accessing the 7th element within p we get an NA as there
# is not element at 7th position in p
p[7]
## [1] NA



Matrices:Initialization, Dimenion and Indexing

All data elements in a matix should be of the SAME TYPE The columns should have the same length

Syntax for creating a matrix
mymatrix <- matrix(vector, nrow=r, ncol=c, byrow=FALSE, dimnames=list(char_vector_rownames, char_vector_colnames))
byrow=TRUE indicates that the matrix should be filled by rows.
byrow=FALSE indicates that the matrix should be filled by columns (the default)
dimnames provides optional labels for the columns and rows.

# generates 5 x 4 numeric matrix 
1:20 # Creates a vector sequence from 1 through 20 with a step of 1
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
y<-matrix(1:20, nrow=5,ncol=4)
y
##      [,1] [,2] [,3] [,4]
## [1,]    1    6   11   16
## [2,]    2    7   12   17
## [3,]    3    8   13   18
## [4,]    4    9   14   19
## [5,]    5   10   15   20
# Accessing element in a matrix
y[1,2]
## [1] 6
y[,] # Outputs the entire matrix
##      [,1] [,2] [,3] [,4]
## [1,]    1    6   11   16
## [2,]    2    7   12   17
## [3,]    3    8   13   18
## [4,]    4    9   14   19
## [5,]    5   10   15   20
# another example
cells <- c(1,26,24,68)
rnames <- c("R1", "R2")
cnames <- c("C1", "C2") 
mymatrix <- matrix(cells, nrow=2, ncol=2, byrow=TRUE,
                   dimnames=list(rnames, cnames)) 
mymatrix
##    C1 C2
## R1  1 26
## R2 24 68
# mymatrix[2,3]
# It will give you a SUBSCRIPT OUT OF BOUND ERROR
# No element exist for this combination

# Identify rows, columns or elements using subscripts. 
y[,4] # 4th column of matrix
## [1] 16 17 18 19 20
y[3,] # 3rd row of matrix 
## [1]  3  8 13 18
y[2:4,1:3] # rows 2,3,4 of columns 1,2,3 
##      [,1] [,2] [,3]
## [1,]    2    7   12
## [2,]    3    8   13
## [3,]    4    9   14
class(y[3,]) # It results in a vector
## [1] "integer"



Lists:Initialization, length and Indexing

Lists are the R objects which contain elements of different data types like numbers, strings, vectors and another list inside it. A list can also contain a matrix or a function as its elements. It is very important to note that all major DATA HEAVY steps involve usageof lists.List is created using list() function.

# Create a list containing a vector, a matrix and a list.
list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2),
                  list("green",12.3))

list_data
## [[1]]
## [1] "Jan" "Feb" "Mar"
## 
## [[2]]
##      [,1] [,2] [,3]
## [1,]    3    5   -2
## [2,]    9    1    8
## 
## [[3]]
## [[3]][[1]]
## [1] "green"
## 
## [[3]][[2]]
## [1] 12.3
# Give names to the elements in the list.
names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list")
names(list_data)
## [1] "1st Quarter"  "A_Matrix"     "A Inner list"
# Accessing the element in the List
# Elements of a List can be accessed by the index of the element in the List

list_data[1] # Accessing the first element in the list
## $`1st Quarter`
## [1] "Jan" "Feb" "Mar"
list_data[[1]][1]# Accessing the first element within the first element
## [1] "Jan"
list_data[[1]][2]
## [1] "Feb"
list_data[[1]][3]
## [1] "Mar"
# Difference between [] and [[]] referencing
list_data[2] # Gives a list
## $A_Matrix
##      [,1] [,2] [,3]
## [1,]    3    5   -2
## [2,]    9    1    8
list_data[[2]] # Gives a matrix
##      [,1] [,2] [,3]
## [1,]    3    5   -2
## [2,]    9    1    8
list_data[3] # Gives a list
## $`A Inner list`
## $`A Inner list`[[1]]
## [1] "green"
## 
## $`A Inner list`[[2]]
## [1] 12.3
list_data[[3]][1] # Gives the inner list
## [[1]]
## [1] "green"
list_data[[3]][[1]] # Gives a vector
## [1] "green"
list_data[[3]][[2]] # Gives a vector
## [1] 12.3
# Manipulating the Elements in a List
# We can add, delete and update list elements as shown below.
# We can add and delete elements only at the end of a list. 
# But we can update any element.

# Add element at the end of the list.
list_data[4] <- "New element"
print(list_data[4])
## [[1]]
## [1] "New element"
# Remove the last element.
list_data[4] <- NULL

# Print the 4th Element.
print(list_data[4])
## $<NA>
## NULL
# Update the 3rd Element.
list_data[3] <- "updated element"
print(list_data[3])
## $`A Inner list`
## [1] "updated element"
# Merging of the List
# Create two lists.
list1 <- list(1,2,3)
list2 <- list("Sun","Mon","Tue")

# Merge the two lists.
merged.list <- c(list1,list2)
merged.list <- list(list1,list2)

# Print the merged list.
print(merged.list)
## [[1]]
## [[1]][[1]]
## [1] 1
## 
## [[1]][[2]]
## [1] 2
## 
## [[1]][[3]]
## [1] 3
## 
## 
## [[2]]
## [[2]][[1]]
## [1] "Sun"
## 
## [[2]][[2]]
## [1] "Mon"
## 
## [[2]][[3]]
## [1] "Tue"
# Converting a List into a vector using unlist function
# Create lists.
list1 <- list(1:5)
print(list1) # Prints the lits
## [[1]]
## [1] 1 2 3 4 5
list1[[1]]# Outputs the vector
## [1] 1 2 3 4 5
# Convert the lists to vectors.
v1 <- unlist(list1)
print(v1)
## [1] 1 2 3 4 5



Data Frame:Initialization, dimesions and Indexing

A data frame is similar to a table with rows and columns.Rows contain one set of values from each column.

The Key Features of a data frame are:
1. The column names should be non-empty.
2. The row names should be unique.
3. The data stored in a data frame can be of numeric, factor or character type.
4. Each column should contain same number of data items.

# Creating a Data Frame

emp.data <- data.frame(
  emp_id = c (1:5), 
  emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
  salary = c(623.3,515.2,611.0,729.0,843.25), 
  
  start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
                         "2015-03-27")),
  stringsAsFactors = FALSE
)

# Print the data frame.         
print(emp.data) 
##   emp_id emp_name salary start_date
## 1      1     Rick 623.30 2012-01-01
## 2      2      Dan 515.20 2013-09-23
## 3      3 Michelle 611.00 2014-11-15
## 4      4     Ryan 729.00 2014-05-11
## 5      5     Gary 843.25 2015-03-27
class(emp.data)
## [1] "data.frame"
# Getting the number of rows and columns:Dimensions of a Data Frame
dim(emp.data)
## [1] 5 4
# Getting the names of the columns
colnames(emp.data)
## [1] "emp_id"     "emp_name"   "salary"     "start_date"
# Getting the first few records
head(emp.data)
##   emp_id emp_name salary start_date
## 1      1     Rick 623.30 2012-01-01
## 2      2      Dan 515.20 2013-09-23
## 3      3 Michelle 611.00 2014-11-15
## 4      4     Ryan 729.00 2014-05-11
## 5      5     Gary 843.25 2015-03-27
head(emp.data,3)
##   emp_id emp_name salary start_date
## 1      1     Rick  623.3 2012-01-01
## 2      2      Dan  515.2 2013-09-23
## 3      3 Michelle  611.0 2014-11-15
# Getting the last few records
tail(emp.data)
##   emp_id emp_name salary start_date
## 1      1     Rick 623.30 2012-01-01
## 2      2      Dan 515.20 2013-09-23
## 3      3 Michelle 611.00 2014-11-15
## 4      4     Ryan 729.00 2014-05-11
## 5      5     Gary 843.25 2015-03-27
tail(emp.data,3)
##   emp_id emp_name salary start_date
## 3      3 Michelle 611.00 2014-11-15
## 4      4     Ryan 729.00 2014-05-11
## 5      5     Gary 843.25 2015-03-27
# Get the Structure of the Data frame
# str is used to get the data types and first few values of the columns used
str(emp.data)
## 'data.frame':    5 obs. of  4 variables:
##  $ emp_id    : int  1 2 3 4 5
##  $ emp_name  : chr  "Rick" "Dan" "Michelle" "Ryan" ...
##  $ salary    : num  623 515 611 729 843
##  $ start_date: Date, format: "2012-01-01" "2013-09-23" ...
class(dim(emp.data))
## [1] "integer"
k<-dim(emp.data)
k[1] # Get the number of rows
## [1] 5
k[2] # Get the number of columns
## [1] 4
# Statistical Summary can be obtained using summary function
summary(emp.data)
##      emp_id    emp_name             salary        start_date        
##  Min.   :1   Length:5           Min.   :515.2   Min.   :2012-01-01  
##  1st Qu.:2   Class :character   1st Qu.:611.0   1st Qu.:2013-09-23  
##  Median :3   Mode  :character   Median :623.3   Median :2014-05-11  
##  Mean   :3                      Mean   :664.4   Mean   :2014-01-14  
##  3rd Qu.:4                      3rd Qu.:729.0   3rd Qu.:2014-11-15  
##  Max.   :5                      Max.   :843.2   Max.   :2015-03-27
# Extract specific column from the data frame

# Extracting the First column
result<-emp.data[,1]
result # data type is vector
## [1] 1 2 3 4 5
# Extracting the First two columns
result<-emp.data[,c(1,2)]
result
##   emp_id emp_name
## 1      1     Rick
## 2      2      Dan
## 3      3 Michelle
## 4      4     Ryan
## 5      5     Gary
result <- emp.data[,c("emp_id","emp_name")]
result
##   emp_id emp_name
## 1      1     Rick
## 2      2      Dan
## 3      3 Michelle
## 4      4     Ryan
## 5      5     Gary
# Getting the first row data
result<-emp.data[1,]
result
##   emp_id emp_name salary start_date
## 1      1     Rick  623.3 2012-01-01
class(result) # Results in a vector
## [1] "data.frame"
# Getting the first two rows data
result<-emp.data[c(1,2),]
result
##   emp_id emp_name salary start_date
## 1      1     Rick  623.3 2012-01-01
## 2      2      Dan  515.2 2013-09-23
# Getting the first row data for column 1
result<-emp.data[1,1]
result
## [1] 1
# Getting the first row data for column 1 and column2
result<-emp.data[1,c(1,2)]
result
##   emp_id emp_name
## 1      1     Rick
# Getting the first and second row data for column 1 and column2
result<-emp.data[c(1,2),c(1,2)]
result
##   emp_id emp_name
## 1      1     Rick
## 2      2      Dan
# Expanding the Data Frame by adding the columns
emp.data$dept<-c("IT","Operations","IT","HR","Finance")
colnames(emp.data)
## [1] "emp_id"     "emp_name"   "salary"     "start_date" "dept"
head(emp.data)
##   emp_id emp_name salary start_date       dept
## 1      1     Rick 623.30 2012-01-01         IT
## 2      2      Dan 515.20 2013-09-23 Operations
## 3      3 Michelle 611.00 2014-11-15         IT
## 4      4     Ryan 729.00 2014-05-11         HR
## 5      5     Gary 843.25 2015-03-27    Finance
# Adding a row using rbind function
# Create the second data frame
emp.newdata <-  data.frame(
  emp_id = c (6:8), 
  emp_name = c("Rasmi","Pranab","Tusar"),
  salary = c(578.0,722.5,632.8), 
  start_date = as.Date(c("2013-05-21","2013-07-30","2014-06-17")),
  dept = c("IT","Operations","Fianance"),
  stringsAsFactors = FALSE
)

emp.newdata
##   emp_id emp_name salary start_date       dept
## 1      6    Rasmi  578.0 2013-05-21         IT
## 2      7   Pranab  722.5 2013-07-30 Operations
## 3      8    Tusar  632.8 2014-06-17   Fianance
dim(emp.newdata) 
## [1] 3 5
colnames(emp.newdata) 
## [1] "emp_id"     "emp_name"   "salary"     "start_date" "dept"
# Bind the two data frames.
emp.finaldata <- rbind.data.frame(emp.data,emp.newdata)
print(emp.finaldata)
##   emp_id emp_name salary start_date       dept
## 1      1     Rick 623.30 2012-01-01         IT
## 2      2      Dan 515.20 2013-09-23 Operations
## 3      3 Michelle 611.00 2014-11-15         IT
## 4      4     Ryan 729.00 2014-05-11         HR
## 5      5     Gary 843.25 2015-03-27    Finance
## 6      6    Rasmi 578.00 2013-05-21         IT
## 7      7   Pranab 722.50 2013-07-30 Operations
## 8      8    Tusar 632.80 2014-06-17   Fianance



Web Scraping Tutorial 4- Getting the busy information data from Popular time page from Google

Popular Times Popular Times In this blog we will try to scrape the ...