Machine Learning Made Easy
Saturday, December 28, 2024
Saturday, October 26, 2024
Web Scraping Tutorial 4- Getting the busy information data from Popular time page from Google
Popular Times
In this blog we will try to scrape the busy text from popular times section in google maps.
Step 0: Importing the libraries
Step 1: Start a headless Firefox browser
driver <- rsDriver(
browser = c("firefox"),
chromever = NULL,
verbose = F,
extraCapabilities = list("firefoxOptions" = list(args = list("--headless")))
)
web_driver <- driver[["client"]]
# This link contains Restaurant links for Cedele
nm<-"cedele restaurant "
ad_url<-str_c("https://www.google.co.id/maps/search/ ",nm)
web_driver$navigate(ad_url)
The page looks like the below image
Step 2: Get the url(links) of one of these restuarants to start with
In order to gt the link, we have to right click on the first store and click on inspect
If you right click on the first restaurant, then the link to the restaurant is at a tag
Just right click on this element and get the xml path
# the xml path of the link
nm1<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[1]/div[1]/div[3]/div/a"
nm1
## [1] "/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[1]/div[1]/div[3]/div/a"
<>br
Using the xml component to access the link
link_restuarants <- web_driver$findElements(using = "xpath", value = nm1)
rest_url<-link_restuarants[[1]]$getElementAttribute("href")[[1]]
rest_url
## [1] "https://www.google.co.id/maps/place/Cedele+Bakery+Kitchen+-+The+Woodleigh+Mall/data=!4m7!3m6!1s0x31da1793c89df043:0xf72df23d7aafbfac!8m2!3d1.3379161!4d103.8723492!16s%2Fg%2F11v05s7v9f!19sChIJQ_CdyJMX2jERrL-vej3yLfc?authuser=0&hl=en&rclk=1"
Navigating to the URL
web_driver$navigate(rest_url)
Step 3: Scrolling Down to the popular times section(*****MOST IMPORTANT)
Google maps, reviews and popular times work very different as compared to other websites when you have to scroll up or down.In most websites, you can just do a scroll down command and the page will scroll down.But in google reviews or google maps for example, there are essentially two pages and you have to scroll down/up in the left section. This is shown below.
So we need to find some creative solution as listed below.
- Find the css element of the scroll bar from the left section
- Use page down key or page up key to scroll appropriately.
3A: Getting the css element of the scroll bar
The css element will come out to be “div.bJzME:nth-child(2) > div:nth-child(1) > div:nth-child(1)”. You need to use css selector and not css path
Once we get the css element using css selector, we can use findElement(and not elements) to create a scroll down step
The scroll process can be used to scroll up as well as down. As you will realise that sometime we need to scroll up and sometimes we need to scroll down within the same data extraction step.This we will se for lets say Monday busy time extraction process.
Step 4: Extracting Information from the Popular times section for Monday
If we scroll down, we would be able to see a histogram like structure as shown below.
We have to extract the height of the bars for different days.For this we have to play with the drop down and select the required day.For our example, lets say we want to check how busy the place is for Monday
Step 1 here is to make visible the different days present in the drop
down menu
and then click on Monday to get the details of occupancy.We can right click on the lower triangle drop down to get the xml element for that as shown below.
Step 2 would be to get the xml for Monday text as shown below
Scrolling down to reach the popular time section
# Getting the css for scroll
scrl_nm<-"div.bJzME:nth-child(2) > div:nth-child(1) > div:nth-child(1)"
scrollable_div <-
try(web_driver$findElement(using = "css",
value = scrl_nm))
# The below code will take us to the popular time section where the lower triangle will just be visible
for(i in 1){
scrollable_div$sendKeysToElement(sendKeys = list(key = "page_down"))
Sys.sleep(1)
}
#Step 1: xml for the drop down triangle from popular times
nm1<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[14]/div[1]/div[1]/div/div/div[2]"
dropdown_value <- web_driver$findElements("xpath", value = nm1)
dropdown_value[[1]]$clickElement()
# once we do it, we would get the option to see all the days
# Step 2: xml for Monday
nm_monday<-"/html/body/div[6]/div[1]/div"
dropdown_click <- web_driver$findElements("xpath", value = nm_monday)
dropdown_click[[1]]$clickElement()
# After running the above two section in the code, Monday appears in the day drop down
# Scrolling down a little to make the busy time graphs a little more visible
# Gettnig the csvv for scroll
scrl_nm<-"div.bJzME:nth-child(2) > div:nth-child(1) > div:nth-child(1)"
scrollable_div <-
try(web_driver$findElement(using = "css",
value = scrl_nm))
for(i in 1 ){
scrollable_div$sendKeysToElement(sendKeys = list(key = "page_down"))
Sys.sleep(1)
}
Now lets extract all the “busy at a certain time info” from the graph
Extracting the graph for Monday
The graph starts from 6 AM and ends at 11 PM.Even though there are no values for lets say 6 AM till about 9.30 and then for times post 9 PM, we would still extract whatever is there in the elements.These elements can be inspected as shown below.
Lets see what the xml looks for some of the time periods to draw a general pattern
# xml for 6 AM
xml_6AM<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[14]/div[3]/div[2]/div/div[1]"
xml_7AM<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[14]/div[3]/div[2]/div/div[2]"
xml_8AM<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[14]/div[3]/div[2]/div/div[3]"
# The above 3 xml paths are same except for the last div.
# The common xml is
nm_common<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[14]/div[3]/div[2]/div/div"
Extracting the individual components.We need to note that it is an aria-label hence we will be using getElementattribute.
# Getting the name using getElementAttribute
timing_xml <- web_driver$findElements(using = "xpath", value = nm_common)
ls_Monday<-list()
j<-0
for(i in 1:length(timing_xml)){
j<-j+1
# Getting the busy details
busy_text <- try(timing_xml[[i]]$getElementAttribute("aria-label")[[1]])
print(busy_text)
ls_Monday[j]<-busy_text
}
## [1] "0% busy at 6 am."
## [1] "0% busy at 7 am."
## [1] "0% busy at 8 am."
## [1] "0% busy at 9 am."
## [1] "4% busy at 10 am."
## [1] "10% busy at 11 am."
## [1] "25% busy at 12 pm."
## [1] "28% busy at 1 pm."
## [1] "33% busy at 2 pm."
## [1] "32% busy at 3 pm."
## [1] "22% busy at 4 pm."
## [1] "19% busy at 5 pm."
## [1] "24% busy at 6 pm."
## [1] "26% busy at 7 pm."
## [1] "31% busy at 8 pm."
## [1] "0% busy at 9 pm."
## [1] "0% busy at 10 pm."
## [1] "0% busy at 11 pm."
ls_Monday[1]
## [[1]]
## [1] "0% busy at 6 am."
Monday_timing=as.character(ls_Monday)
monday_df<-data.frame(Day="Monday",
Busy_Details=Monday_timing)
monday_df
## Day Busy_Details
## 1 Monday 0% busy at 6 am.
## 2 Monday 0% busy at 7 am.
## 3 Monday 0% busy at 8 am.
## 4 Monday 0% busy at 9 am.
## 5 Monday 4% busy at 10 am.
## 6 Monday 10% busy at 11 am.
## 7 Monday 25% busy at 12 pm.
## 8 Monday 28% busy at 1 pm.
## 9 Monday 33% busy at 2 pm.
## 10 Monday 32% busy at 3 pm.
## 11 Monday 22% busy at 4 pm.
## 12 Monday 19% busy at 5 pm.
## 13 Monday 24% busy at 6 pm.
## 14 Monday 26% busy at 7 pm.
## 15 Monday 31% busy at 8 pm.
## 16 Monday 0% busy at 9 pm.
## 17 Monday 0% busy at 10 pm.
## 18 Monday 0% busy at 11 pm.
Lets try and extract for Wednesday
Extracting Information from the popular times section for Wednesday
If we scroll down, we would be able to see a histogram like structure as shown below.
We have to extract the height of the bars for different days.For this we have to play with the drop down and select the required day.For our example, lets say we want to check how busy the place is for Wednesday
# Scroll up a little to make the drop down trinagle visible
for(i in 1 ){
scrollable_div$sendKeysToElement(sendKeys = list(key = "page_up"))
Sys.sleep(1)
}
# Run the entire section in one go
#xml for the drop down
nm1<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[14]/div[1]/div[1]/div/div/div[2]"
dropdown_value <- web_driver$findElements("xpath", value = nm1)
dropdown_value[[1]]$clickElement()
# once we do it, we would get the option to see all the days
# xml for Wednesday
nm_wednesday<-"/html/body/div[6]/div[3]/div"
dropdown_click <- web_driver$findElements("xpath", value = nm_wednesday)
dropdown_click[[1]]$clickElement()
# After running the above two section in the code, Monday appears in the day drop down
# Scroll down a little to make the chart visible
for(i in 1 ){
scrollable_div$sendKeysToElement(sendKeys = list(key = "page_down"))
Sys.sleep(1)
# try(web_driver$executeScript("arguments[0].scrollTop = arguments[0].scrollHeight",
# scrollable_div))
}
Now lets extract all the “busy at a certain time info” from the graph
Extarcting the graph for Wednesday
The graph starts from 6 AM and ends at 11 PM.Even though there are no values for lets say 6 AM till about 9.30 and then for times post 9 PM, we would still extract whatever is there in the elements
# xml for 6 AM
xml_6AM<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[14]/div[3]/div[4]/div/div[1]"
xml_7AM<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[14]/div[3]/div[4]/div/div[2]"
xml_8AM<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[14]/div[3]/div[4]/div/div[3]"
# The common xml is
nm_common<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[14]/div[3]/div[4]/div/div"
Extracting the individual components
timing_xml <- web_driver$findElements(using = "xpath", value = nm_common)
# Getting the name using getElementText
ls_wednesday<-list()
j<-0
for(i in 1:length(timing_xml)){
j<-j+1
# Getting the busy details
busy_text <- try(timing_xml[[i]]$getElementAttribute("aria-label")[[1]])
print(busy_text)
ls_wednesday[j]<-busy_text
}
## [1] "0% busy at 6 am."
## [1] "0% busy at 7 am."
## [1] "0% busy at 8 am."
## [1] "0% busy at 9 am."
## [1] "36% busy at 10 am."
## [1] "55% busy at 11 am."
## [1] "70% busy at 12 pm."
## [1] "55% busy at 1 pm."
## [1] "33% busy at 2 pm."
## [1] "12% busy at 3 pm."
## [1] "9% busy at 4 pm."
## [1] "16% busy at 5 pm."
## [1] "35% busy at 6 pm."
## [1] "57% busy at 7 pm."
## [1] "54% busy at 8 pm."
## [1] "0% busy at 9 pm."
## [1] "0% busy at 10 pm."
## [1] "0% busy at 11 pm."
wednesday_timing=as.character(ls_wednesday)
wednesday_df<-data.frame(Day="Wednesday",
Busy_Details=wednesday_timing)
wednesday_df
## Day Busy_Details
## 1 Wednesday 0% busy at 6 am.
## 2 Wednesday 0% busy at 7 am.
## 3 Wednesday 0% busy at 8 am.
## 4 Wednesday 0% busy at 9 am.
## 5 Wednesday 36% busy at 10 am.
## 6 Wednesday 55% busy at 11 am.
## 7 Wednesday 70% busy at 12 pm.
## 8 Wednesday 55% busy at 1 pm.
## 9 Wednesday 33% busy at 2 pm.
## 10 Wednesday 12% busy at 3 pm.
## 11 Wednesday 9% busy at 4 pm.
## 12 Wednesday 16% busy at 5 pm.
## 13 Wednesday 35% busy at 6 pm.
## 14 Wednesday 57% busy at 7 pm.
## 15 Wednesday 54% busy at 8 pm.
## 16 Wednesday 0% busy at 9 pm.
## 17 Wednesday 0% busy at 10 pm.
## 18 Wednesday 0% busy at 11 pm.
Embed Shiny
Please wait...
-
Complete List of various topics in R Complete List of various topics in R Parag Verma Basics o...
-
Customer Journey Analysis Customer Journey Analysis Parag Verma 10th J...
-
Sensors are used in a lot of industrial applications to measure properties of a process. This can be temperature, pressure, humidity, den...