Web Scrapping Tutorial 2: Getting Overall rating and number of reviews
2024-04-23
Introduction
In the first tutorial, we looked at how we can use Rselenium to extract the contents from the web. We specifically looked at how to leverage xpath from a web element(such as store name) to scrape information from google reviews. We looked at the following functions to extract data:
- web_driver$navigate(l1)
- web_driver$findElements
- getElementAttribute(“href”)
- web_driver\(findElements(using = "xpath", value = nm)[[1]]\)getElementText()
Moving on, in this blog we would understand how to extract the average google ratings and total number of reviews given for each store (from previous examples)
Step 0: How would Rselnium do web scraping for these two stores
We would use the following steps to get the information
- Start a headless browser
- Navigate to the google map page(shown above)
- Get the url(links) of each of these stores
- Navigate on each of these links
- Get the xpath for the store name and address
- For each of the xpaths(names and address), get the element sitting at these locations
So as the first step, we will start a headless browser.Firefox works fine in my system so I would go with Firefox browser. Before this, lets import the required libraries
package.name<-c("tidyverse","RSelenium")
for(i in package.name){
if(!require(i,character.only = T)){
install.packages(i)
}
library(i,character.only = T)
}
Loading required package: tidyverse
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.0 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Loading required package: RSelenium
Step 1:Start a headless Firefox browser
The syntax for initiating a headless Firefox browser is shown below
driver <- rsDriver(
browser = c("firefox"),
chromever = NULL,
verbose = F,
extraCapabilities = list("firefoxOptions" = list(args = list("--headless")))
)
web_driver <- driver[["client"]]
Once I execute this, Firefox browser would pop up in the
background as shown below.
Step 3: Getting the URL for each store
We can see that there are just two stores here.We will get the store name and corresponding address in a data frame
For this, we will have to follow a two steps process:
- Get the URL for each store
- Once you get the URL, access the URL link and then get the name and address
Get the XML path of the URL link through Inspection. The XML path for the two stores would look like the below:
- /html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[1]/div[1]/div[4]/div/a ## Including Plots
- /html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[1]/div[1]/div[6]/div/a
The difference between the two is only wrt to the penultimate div element. For the first store it is div[4] and for the second store it is div[6]
Now we will use this information to extract all the links. For each of these XML path, we need to get the href(url)
The penultimate div which is the only difference between store1 and store2 XML paths will be specified as div(instead of div[4] or div[6]) and then each of these elements, we will extract the href using
# l1<-list()
link_Store <- web_driver$findElements(using = "xpath", value = "/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[1]/div[1]/div/div/a")
# print(store)
l1<-list()
for(i in 1:length(link_Store) ){
l1[[i]]<-link_Store[[i]]$getElementAttribute("href")[[1]]
}
l1
[[1]]
[1] "https://www.google.co.id/maps/place/JUMBO+KING+MULUND/data=!4m7!3m6!1s0x3be7b9f243892907:0xc1c58fde55a52ab8!8m2!3d19.1716851!4d72.9552551!16s%2Fg%2F11c6lk8nbl!19sChIJBymJQ_K55zsRuCqlVd6PxcE?authuser=0&hl=en&rclk=1"
[[2]]
[1] "https://www.google.co.id/maps/place/Jumbo+Vada+Pav/data=!4m7!3m6!1s0x3be7b9ee2763f83f:0x6a56910364c6346b!8m2!3d19.1722852!4d72.9559067!16s%2Fg%2F11t10_mhq2!19sChIJP_hjJ-655zsRazTGZAORVmo?authuser=0&hl=en&rclk=1"
We can see that the url for the two stores are now stored in l1 list. We will use these links to navigate to individual store site and then extract the store name and address
Step 4: Getting the store name and address
Now we will navigate to each of these store links, get the XML path for each of the store name and address and extract the corresponding elements
Step 4a: Getting store name and address for store 1
web_driver$navigate(l1[[1]])
# Th XML path where the store name is located is same for both the stores
nm1_name<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[2]/div/div[1]/div[1]/h1"
# Getting the Store Name
store_nm1 <- web_driver$findElements(using = "xpath", value = nm1_name)[[1]]$getElementText()[[1]]
# Th XML path where the store address is located
nm1_add<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[9]/div[3]/button/div/div[2]/div[1]"
# Getting the Store address
store_add1 <- web_driver$findElements(using = "xpath", value = nm1_add)[[1]]$getElementText()[[1]]
store1.df<-data.frame(Store_Name=store_nm1,
Store_Address=store_add1)
store1.df
Store_Name
1 JUMBO KING MULUND
Store_Address
1 Shop no 1, Sardar Vallabhbhai Patel Rd, Mulund West, Mumbai, Maharashtra 400080, India
Step 4b: Getting store name and address for store 2
web_driver$navigate(l1[[2]])
# Th XML path where the store name is located is same for both the stores
nm2_name<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[2]/div/div[1]/div[1]/h1"
# Getting the Store Name
store_nm2 <- web_driver$findElements(using = "xpath", value = nm2_name)[[1]]$getElementText()[[1]]
# Th XML path where the store address is located
nm2_add<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[7]/div[3]/button/div/div[2]/div[1]"
# Getting the Store address
store_add2 <- web_driver$findElements(using = "xpath", value = nm2_add)[[1]]$getElementText()[[1]]
store2.df<-data.frame(Store_Name=store_nm2,
Store_Address=store_add2)
store2.df
Store_Name
1 Jumbo Vada Pav
Store_Address
1 8, Sardar Vallabhbhai Patel Rd, Vidya Vihar, Mulund West, Mumbai, Maharashtra 400080, India
Lets combine the two data frames
interim.df<-rbind.data.frame(store1.df,store2.df)
interim.df
Store_Name
1 JUMBO KING MULUND
2 Jumbo Vada Pav
Store_Address
1 Shop no 1, Sardar Vallabhbhai Patel Rd, Mulund West, Mumbai, Maharashtra 400080, India
2 8, Sardar Vallabhbhai Patel Rd, Vidya Vihar, Mulund West, Mumbai, Maharashtra 400080, India
Step 5: Geeting the logic for Total Average rating
For this, we will first use web_driver$navigate(l1[[1]]) to go to the URL for store 1. It would look something like this
Now we would highlight 3.6 and inspect its element and get the corresponding XML path
The XML path for rating is: /html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[2]/div/div[1]/div[2]/div/div[1]/div[2]/span[1]/span[1]
Similarly, we can get the XML path for total respondents
XML path is: /html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[2]/div/div[1]/div[2]/div/div[1]/div[2]/span[2]/span/span
Now lets use the above two to get the total average rating and total number of reviews for Store1
web_driver$navigate(l1[[1]])
# Th XML path where the store rating is located
nm1_rating<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[2]/div/div[1]/div[2]/div/div[1]/div[2]/span[1]/span[1]"
# Getting the Store Rating
store_rating1 <- web_driver$findElements(using = "xpath", value = nm1_rating)[[1]]$getElementText()[[1]]
# Th XML path where the store review count is located
nm1_count_review<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[2]/div/div[1]/div[2]/div/div[1]/div[2]/span[2]/span/span"
# Getting the Store address
store_review_count1 <- web_driver$findElements(using = "xpath", value = nm1_count_review)[[1]]$getElementText()[[1]]
store1.rating.df<-data.frame(Avg_Rating=store_rating1,
Total_Review_Count=store_review_count1)
store1.rating.df
Avg_Rating Total_Review_Count
1 3.6 (150)
Store2
web_driver$navigate(l1[[2]])
# Th XML path where the store rating is located
nm2_rating<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[2]/div/div[1]/div[2]/div/div[1]/div[2]/span[1]/span[1]"
# Getting the Store Rating
store_rating2 <- web_driver$findElements(using = "xpath", value = nm2_rating)[[1]]$getElementText()[[1]]
# Th XML path where the store review count is located
nm2_count_review<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[2]/div/div[1]/div[2]/div/div[1]/div[2]/span[2]/span/span"
# Getting the Store address
store_review_count2 <- web_driver$findElements(using = "xpath", value = nm2_count_review)[[1]]$getElementText()[[1]]
store2.rating.df<-data.frame(Avg_Rating=store_rating2,
Total_Review_Count=store_review_count2)
store2.rating.df
Avg_Rating Total_Review_Count
1 4.8 (8)
interim.df2<-rbind.data.frame(store1.rating.df,store2.rating.df)
interim.df2
Avg_Rating Total_Review_Count
1 3.6 (150)
2 4.8 (8)
Having discussed the above, lets see if we could do the above in one shot. For this, we would have to do the following:
- Navigate to each store link
- Get store name, address, rating and review count based on common XML path between stores
Step 6: Running all the logic at once
Running the headless browsing and getting the individual store URLs
nm<-"Jumbo wada pav mulund west "
ad_url<-str_c("https://www.google.co.id/maps/search/ ",nm)
# Now navigate to the URL.This is for the browser to go to that location
web_driver$navigate(ad_url)
# l1<-list()
link_Store <- web_driver$findElements(using = "xpath", value = "/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[1]/div[1]/div/div/a")
# print(store)
l1<-list()
for(i in 1:length(link_Store) ){
l1[[i]]<-link_Store[[i]]$getElementAttribute("href")[[1]]
}
l1
[[1]]
[1] "https://www.google.co.id/maps/place/JUMBO+KING+MULUND/data=!4m7!3m6!1s0x3be7b9f243892907:0xc1c58fde55a52ab8!8m2!3d19.1716851!4d72.9552551!16s%2Fg%2F11c6lk8nbl!19sChIJBymJQ_K55zsRuCqlVd6PxcE?authuser=0&hl=en&rclk=1"
[[2]]
[1] "https://www.google.co.id/maps/place/Jumbo+Vada+Pav/data=!4m7!3m6!1s0x3be7b9ee2763f83f:0x6a56910364c6346b!8m2!3d19.1722852!4d72.9559067!16s%2Fg%2F11t10_mhq2!19sChIJP_hjJ-655zsRazTGZAORVmo?authuser=0&hl=en&rclk=1"
For each store, getting the following
- Name
- Address
- Avg Rating
- Review count
l2<-list()
k<-1
for(i in l1){
k<-k+1
# Acessing the store url
web_driver$navigate(i)
#############################STORE NAME#################################################################
# The XML path where the store name is located is same for both the stores
store_name_xml<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[2]/div/div[1]/div[1]/h1"
# Getting the Store Name
store_name <- web_driver$findElements(using = "xpath", value = store_name_xml)[[1]]$getElementText()[[1]]
#############################STORE NAME#################################################################
#############################STORE ADDRESS##############################################################
store_add_xml<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div/div[3]/button/div/div[2]/div[1]"
# Getting the Store Address
store_add <- web_driver$findElements(using = "xpath", value = store_add_xml)[[1]]$getElementText()[[1]]
#############################STORE ADDRESS##############################################################
#############################STORE RAING################################################################
store_rating_xml<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[2]/div/div[1]/div[2]/div/div[1]/div[2]/span[1]/span[1]"
# Getting the Store Avg Rating
store_rating <- web_driver$findElements(using = "xpath", value = store_rating_xml)[[1]]$getElementText()[[1]]
#############################STORE RATING###############################################################
#############################STORE REVIEW COUNT#########################################################
store_rating_count_xml<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[2]/div/div[1]/div[2]/div/div[1]/div[2]/span[2]/span/span"
# Getting the Store Review Count
store_review_count <- web_driver$findElements(using = "xpath", value = store_rating_count_xml)[[1]]$getElementText()[[1]]
#############################STORE REVIEW COUNT#########################################################
# data frame containing details
store.df<-data.frame(Store_Name=store_name,
Store_Address=store_add,
Store_Rating=store_rating,
Store_Total_Review=store_review_count)
l2[[k]]<-store.df
}
final.df<-do.call(rbind.data.frame,l2)
final.df
Store_Name
1 JUMBO KING MULUND
2 Jumbo Vada Pav
Store_Address
1
2 8, Sardar Vallabhbhai Patel Rd, Vidya Vihar, Mulund West, Mumbai, Maharashtra 400080, India
Store_Rating Store_Total_Review
1 3.6 (150)
2 4.8 (8)
No comments:
Post a Comment