Different Data Structures in R
Parag Verma
4th Dec, 2019
Vectors:Initialization, length and Indexing
Vectors are the most basic entities in R.They are the building blocks of storing information. We can store various type of information/values in them. This includes numbers, text, logical values and so on. Lets look at how to create vectors and what are the different types.
# Let us define vectors p,q and r
p <- c(1,2,5.3,6,-2,4) # numeric vector
q <- c("one","two","three") # character vector
r <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) #logical vector
# The 'c' before the bracket is used for combining the values mentioned in the braces
# Lets check what is there in vector 'p'
p
## [1] 1.0 2.0 5.3 6.0 -2.0 4.0
class(p) # class gives the nature of values stored in 'p' vector
## [1] "numeric"
# Lets check what is there in vector 'q'
q
## [1] "one" "two" "three"
class(q)
## [1] "character"
# Lets check what is there in vector 'r'
r
## [1] TRUE TRUE TRUE FALSE TRUE FALSE
class(r)
## [1] "logical"
# Refer to elements of a vector using subscripts.
# Accessing the Third element within p
p[3]
## [1] 5.3
# Accessing the Third and fourth element
p[c(3,4)]
## [1] 5.3 6.0
# Let us calculate the length of the vector p
length(p)
## [1] 6
# On accessing the 7th element within p we get an NA as there
# is not element at 7th position in p
p[7]
## [1] NA
Matrices:Initialization, Dimenion and Indexing
All data elements in a matix should be of the SAME TYPE The columns should have the same length
Syntax for creating a matrix
mymatrix <- matrix(vector, nrow=r, ncol=c, byrow=FALSE, dimnames=list(char_vector_rownames, char_vector_colnames))
byrow=TRUE indicates that the matrix should be filled by rows.
byrow=FALSE indicates that the matrix should be filled by columns (the default)
dimnames provides optional labels for the columns and rows.
# generates 5 x 4 numeric matrix
1:20 # Creates a vector sequence from 1 through 20 with a step of 1
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
y<-matrix(1:20, nrow=5,ncol=4)
y
## [,1] [,2] [,3] [,4]
## [1,] 1 6 11 16
## [2,] 2 7 12 17
## [3,] 3 8 13 18
## [4,] 4 9 14 19
## [5,] 5 10 15 20
# Accessing element in a matrix
y[1,2]
## [1] 6
y[,] # Outputs the entire matrix
## [,1] [,2] [,3] [,4]
## [1,] 1 6 11 16
## [2,] 2 7 12 17
## [3,] 3 8 13 18
## [4,] 4 9 14 19
## [5,] 5 10 15 20
# another example
cells <- c(1,26,24,68)
rnames <- c("R1", "R2")
cnames <- c("C1", "C2")
mymatrix <- matrix(cells, nrow=2, ncol=2, byrow=TRUE,
dimnames=list(rnames, cnames))
mymatrix
## C1 C2
## R1 1 26
## R2 24 68
# mymatrix[2,3]
# It will give you a SUBSCRIPT OUT OF BOUND ERROR
# No element exist for this combination
# Identify rows, columns or elements using subscripts.
y[,4] # 4th column of matrix
## [1] 16 17 18 19 20
y[3,] # 3rd row of matrix
## [1] 3 8 13 18
y[2:4,1:3] # rows 2,3,4 of columns 1,2,3
## [,1] [,2] [,3]
## [1,] 2 7 12
## [2,] 3 8 13
## [3,] 4 9 14
class(y[3,]) # It results in a vector
## [1] "integer"
Lists:Initialization, length and Indexing
Lists are the R objects which contain elements of different data types like numbers, strings, vectors and another list inside it. A list can also contain a matrix or a function as its elements. It is very important to note that all major DATA HEAVY steps involve usageof lists.List is created using list() function.
# Create a list containing a vector, a matrix and a list.
list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2),
list("green",12.3))
list_data
## [[1]]
## [1] "Jan" "Feb" "Mar"
##
## [[2]]
## [,1] [,2] [,3]
## [1,] 3 5 -2
## [2,] 9 1 8
##
## [[3]]
## [[3]][[1]]
## [1] "green"
##
## [[3]][[2]]
## [1] 12.3
# Give names to the elements in the list.
names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list")
names(list_data)
## [1] "1st Quarter" "A_Matrix" "A Inner list"
# Accessing the element in the List
# Elements of a List can be accessed by the index of the element in the List
list_data[1] # Accessing the first element in the list
## $`1st Quarter`
## [1] "Jan" "Feb" "Mar"
list_data[[1]][1]# Accessing the first element within the first element
## [1] "Jan"
list_data[[1]][2]
## [1] "Feb"
list_data[[1]][3]
## [1] "Mar"
# Difference between [] and [[]] referencing
list_data[2] # Gives a list
## $A_Matrix
## [,1] [,2] [,3]
## [1,] 3 5 -2
## [2,] 9 1 8
list_data[[2]] # Gives a matrix
## [,1] [,2] [,3]
## [1,] 3 5 -2
## [2,] 9 1 8
list_data[3] # Gives a list
## $`A Inner list`
## $`A Inner list`[[1]]
## [1] "green"
##
## $`A Inner list`[[2]]
## [1] 12.3
list_data[[3]][1] # Gives the inner list
## [[1]]
## [1] "green"
list_data[[3]][[1]] # Gives a vector
## [1] "green"
list_data[[3]][[2]] # Gives a vector
## [1] 12.3
# Manipulating the Elements in a List
# We can add, delete and update list elements as shown below.
# We can add and delete elements only at the end of a list.
# But we can update any element.
# Add element at the end of the list.
list_data[4] <- "New element"
print(list_data[4])
## [[1]]
## [1] "New element"
# Remove the last element.
list_data[4] <- NULL
# Print the 4th Element.
print(list_data[4])
## $<NA>
## NULL
# Update the 3rd Element.
list_data[3] <- "updated element"
print(list_data[3])
## $`A Inner list`
## [1] "updated element"
# Merging of the List
# Create two lists.
list1 <- list(1,2,3)
list2 <- list("Sun","Mon","Tue")
# Merge the two lists.
merged.list <- c(list1,list2)
merged.list <- list(list1,list2)
# Print the merged list.
print(merged.list)
## [[1]]
## [[1]][[1]]
## [1] 1
##
## [[1]][[2]]
## [1] 2
##
## [[1]][[3]]
## [1] 3
##
##
## [[2]]
## [[2]][[1]]
## [1] "Sun"
##
## [[2]][[2]]
## [1] "Mon"
##
## [[2]][[3]]
## [1] "Tue"
# Converting a List into a vector using unlist function
# Create lists.
list1 <- list(1:5)
print(list1) # Prints the lits
## [[1]]
## [1] 1 2 3 4 5
list1[[1]]# Outputs the vector
## [1] 1 2 3 4 5
# Convert the lists to vectors.
v1 <- unlist(list1)
print(v1)
## [1] 1 2 3 4 5
Data Frame:Initialization, dimesions and Indexing
A data frame is similar to a table with rows and columns.Rows contain one set of values from each column.
The Key Features of a data frame are:
1. The column names should be non-empty.
2. The row names should be unique.
3. The data stored in a data frame can be of numeric, factor or character type.
4. Each column should contain same number of data items.
# Creating a Data Frame
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)
# Print the data frame.
print(emp.data)
## emp_id emp_name salary start_date
## 1 1 Rick 623.30 2012-01-01
## 2 2 Dan 515.20 2013-09-23
## 3 3 Michelle 611.00 2014-11-15
## 4 4 Ryan 729.00 2014-05-11
## 5 5 Gary 843.25 2015-03-27
class(emp.data)
## [1] "data.frame"
# Getting the number of rows and columns:Dimensions of a Data Frame
dim(emp.data)
## [1] 5 4
# Getting the names of the columns
colnames(emp.data)
## [1] "emp_id" "emp_name" "salary" "start_date"
# Getting the first few records
head(emp.data)
## emp_id emp_name salary start_date
## 1 1 Rick 623.30 2012-01-01
## 2 2 Dan 515.20 2013-09-23
## 3 3 Michelle 611.00 2014-11-15
## 4 4 Ryan 729.00 2014-05-11
## 5 5 Gary 843.25 2015-03-27
head(emp.data,3)
## emp_id emp_name salary start_date
## 1 1 Rick 623.3 2012-01-01
## 2 2 Dan 515.2 2013-09-23
## 3 3 Michelle 611.0 2014-11-15
# Getting the last few records
tail(emp.data)
## emp_id emp_name salary start_date
## 1 1 Rick 623.30 2012-01-01
## 2 2 Dan 515.20 2013-09-23
## 3 3 Michelle 611.00 2014-11-15
## 4 4 Ryan 729.00 2014-05-11
## 5 5 Gary 843.25 2015-03-27
tail(emp.data,3)
## emp_id emp_name salary start_date
## 3 3 Michelle 611.00 2014-11-15
## 4 4 Ryan 729.00 2014-05-11
## 5 5 Gary 843.25 2015-03-27
# Get the Structure of the Data frame
# str is used to get the data types and first few values of the columns used
str(emp.data)
## 'data.frame': 5 obs. of 4 variables:
## $ emp_id : int 1 2 3 4 5
## $ emp_name : chr "Rick" "Dan" "Michelle" "Ryan" ...
## $ salary : num 623 515 611 729 843
## $ start_date: Date, format: "2012-01-01" "2013-09-23" ...
class(dim(emp.data))
## [1] "integer"
k<-dim(emp.data)
k[1] # Get the number of rows
## [1] 5
k[2] # Get the number of columns
## [1] 4
# Statistical Summary can be obtained using summary function
summary(emp.data)
## emp_id emp_name salary start_date
## Min. :1 Length:5 Min. :515.2 Min. :2012-01-01
## 1st Qu.:2 Class :character 1st Qu.:611.0 1st Qu.:2013-09-23
## Median :3 Mode :character Median :623.3 Median :2014-05-11
## Mean :3 Mean :664.4 Mean :2014-01-14
## 3rd Qu.:4 3rd Qu.:729.0 3rd Qu.:2014-11-15
## Max. :5 Max. :843.2 Max. :2015-03-27
# Extract specific column from the data frame
# Extracting the First column
result<-emp.data[,1]
result # data type is vector
## [1] 1 2 3 4 5
# Extracting the First two columns
result<-emp.data[,c(1,2)]
result
## emp_id emp_name
## 1 1 Rick
## 2 2 Dan
## 3 3 Michelle
## 4 4 Ryan
## 5 5 Gary
result <- emp.data[,c("emp_id","emp_name")]
result
## emp_id emp_name
## 1 1 Rick
## 2 2 Dan
## 3 3 Michelle
## 4 4 Ryan
## 5 5 Gary
# Getting the first row data
result<-emp.data[1,]
result
## emp_id emp_name salary start_date
## 1 1 Rick 623.3 2012-01-01
class(result) # Results in a vector
## [1] "data.frame"
# Getting the first two rows data
result<-emp.data[c(1,2),]
result
## emp_id emp_name salary start_date
## 1 1 Rick 623.3 2012-01-01
## 2 2 Dan 515.2 2013-09-23
# Getting the first row data for column 1
result<-emp.data[1,1]
result
## [1] 1
# Getting the first row data for column 1 and column2
result<-emp.data[1,c(1,2)]
result
## emp_id emp_name
## 1 1 Rick
# Getting the first and second row data for column 1 and column2
result<-emp.data[c(1,2),c(1,2)]
result
## emp_id emp_name
## 1 1 Rick
## 2 2 Dan
# Expanding the Data Frame by adding the columns
emp.data$dept<-c("IT","Operations","IT","HR","Finance")
colnames(emp.data)
## [1] "emp_id" "emp_name" "salary" "start_date" "dept"
head(emp.data)
## emp_id emp_name salary start_date dept
## 1 1 Rick 623.30 2012-01-01 IT
## 2 2 Dan 515.20 2013-09-23 Operations
## 3 3 Michelle 611.00 2014-11-15 IT
## 4 4 Ryan 729.00 2014-05-11 HR
## 5 5 Gary 843.25 2015-03-27 Finance
# Adding a row using rbind function
# Create the second data frame
emp.newdata <- data.frame(
emp_id = c (6:8),
emp_name = c("Rasmi","Pranab","Tusar"),
salary = c(578.0,722.5,632.8),
start_date = as.Date(c("2013-05-21","2013-07-30","2014-06-17")),
dept = c("IT","Operations","Fianance"),
stringsAsFactors = FALSE
)
emp.newdata
## emp_id emp_name salary start_date dept
## 1 6 Rasmi 578.0 2013-05-21 IT
## 2 7 Pranab 722.5 2013-07-30 Operations
## 3 8 Tusar 632.8 2014-06-17 Fianance
dim(emp.newdata)
## [1] 3 5
colnames(emp.newdata)
## [1] "emp_id" "emp_name" "salary" "start_date" "dept"
# Bind the two data frames.
emp.finaldata <- rbind.data.frame(emp.data,emp.newdata)
print(emp.finaldata)
## emp_id emp_name salary start_date dept
## 1 1 Rick 623.30 2012-01-01 IT
## 2 2 Dan 515.20 2013-09-23 Operations
## 3 3 Michelle 611.00 2014-11-15 IT
## 4 4 Ryan 729.00 2014-05-11 HR
## 5 5 Gary 843.25 2015-03-27 Finance
## 6 6 Rasmi 578.00 2013-05-21 IT
## 7 7 Pranab 722.50 2013-07-30 Operations
## 8 8 Tusar 632.80 2014-06-17 Fianance