Popular Times

In this blog we will try to scrape the busy text from popular times section in google maps.

Step 0: Importing the libraries

Step 1: Start a headless Firefox browser

driver <- rsDriver( 
    browser = c("firefox"), 
    chromever = NULL, 
    verbose = F, 
    extraCapabilities = list("firefoxOptions" = list(args = list("--headless"))) 
) 
web_driver <- driver[["client"]] 

# This link contains Restaurant links for Cedele
nm<-"cedele restaurant "
ad_url<-str_c("https://www.google.co.id/maps/search/ ",nm)

web_driver$navigate(ad_url)

The page looks like the below image

Step 2: Get the url(links) of one of these restuarants to start with

In order to gt the link, we have to right click on the first store and click on inspect

If you right click on the first restaurant, then the link to the restaurant is at a tag

Just right click on this element and get the xml path

# the xml path of the link 
nm1<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[1]/div[1]/div[3]/div/a"
nm1

## [1] "/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[1]/div[1]/div[3]/div/a"

<>br

Using the xml component to access the link

link_restuarants <- web_driver$findElements(using = "xpath", value = nm1)
rest_url<-link_restuarants[[1]]$getElementAttribute("href")[[1]]
rest_url

## [1] "https://www.google.co.id/maps/place/Cedele+Bakery+Kitchen+-+The+Woodleigh+Mall/data=!4m7!3m6!1s0x31da1793c89df043:0xf72df23d7aafbfac!8m2!3d1.3379161!4d103.8723492!16s%2Fg%2F11v05s7v9f!19sChIJQ_CdyJMX2jERrL-vej3yLfc?authuser=0&hl=en&rclk=1"

Navigating to the URL

web_driver$navigate(rest_url)

Step 3: Scrolling Down to the popular times section(*****MOST IMPORTANT)

Google maps, reviews and popular times work very different as compared to other websites when you have to scroll up or down.In most websites, you can just do a scroll down command and the page will scroll down.But in google reviews or google maps for example, there are essentially two pages and you have to scroll down/up in the left section. This is shown below.

So we need to find some creative solution as listed below.

Find the css element of the scroll bar from the left section
Use page down key or page up key to scroll appropriately.

3A: Getting the css element of the scroll bar

The css element will come out to be “div.bJzME:nth-child(2) > div:nth-child(1) > div:nth-child(1)”. You need to use css selector and not css path

Once we get the css element using css selector, we can use findElement(and not elements) to create a scroll down step

The scroll process can be used to scroll up as well as down. As you will realise that sometime we need to scroll up and sometimes we need to scroll down within the same data extraction step.This we will se for lets say Monday busy time extraction process.

Step 4: Extracting Information from the Popular times section for Monday

If we scroll down, we would be able to see a histogram like structure as shown below.

We have to extract the height of the bars for different days.For this we have to play with the drop down and select the required day.For our example, lets say we want to check how busy the place is for Monday

Step 1 here is to make visible the different days present in the drop down menu

and then click on Monday to get the details of occupancy.We can right click on the lower triangle drop down to get the xml element for that as shown below.

Step 2 would be to get the xml for Monday text as shown below

Scrolling down to reach the popular time section

# Getting the css for scroll
scrl_nm<-"div.bJzME:nth-child(2) > div:nth-child(1) > div:nth-child(1)"
scrollable_div <-
  try(web_driver$findElement(using = "css",
                              value = scrl_nm))

# The below code will take us to the popular time section where the lower triangle will just be visible
for(i in 1){
   scrollable_div$sendKeysToElement(sendKeys = list(key = "page_down"))
  Sys.sleep(1)

  }


#Step 1: xml for the drop down triangle from popular times
nm1<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[14]/div[1]/div[1]/div/div/div[2]"

dropdown_value <- web_driver$findElements("xpath", value = nm1)
dropdown_value[[1]]$clickElement()
# once we do it, we would get the option to see all the days


# Step 2: xml for Monday
nm_monday<-"/html/body/div[6]/div[1]/div"
dropdown_click <- web_driver$findElements("xpath", value = nm_monday)
dropdown_click[[1]]$clickElement()
# After running the above two section in the code, Monday appears in the day drop down

# Scrolling down a little to make the busy time graphs a little more visible
# Gettnig the csvv for scroll
scrl_nm<-"div.bJzME:nth-child(2) > div:nth-child(1) > div:nth-child(1)"
scrollable_div <-
  try(web_driver$findElement(using = "css",
                              value = scrl_nm))

for(i in 1 ){
   scrollable_div$sendKeysToElement(sendKeys = list(key = "page_down"))
  Sys.sleep(1)
  
}

Now lets extract all the “busy at a certain time info” from the graph

Extracting the graph for Monday

The graph starts from 6 AM and ends at 11 PM.Even though there are no values for lets say 6 AM till about 9.30 and then for times post 9 PM, we would still extract whatever is there in the elements.These elements can be inspected as shown below.

Lets see what the xml looks for some of the time periods to draw a general pattern

# xml for 6 AM
xml_6AM<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[14]/div[3]/div[2]/div/div[1]"

xml_7AM<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[14]/div[3]/div[2]/div/div[2]"

xml_8AM<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[14]/div[3]/div[2]/div/div[3]"

# The above 3 xml paths are same except for the last div.

# The common xml is 
nm_common<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[14]/div[3]/div[2]/div/div"

Extracting the individual components.We need to note that it is an aria-label hence we will be using getElementattribute.

# Getting the name using getElementAttribute
timing_xml <- web_driver$findElements(using = "xpath", value = nm_common)


ls_Monday<-list()
j<-0
for(i in  1:length(timing_xml)){
  
  j<-j+1
  
  # Getting the busy details
  busy_text <- try(timing_xml[[i]]$getElementAttribute("aria-label")[[1]])
  print(busy_text)
  ls_Monday[j]<-busy_text
}

## [1] "0% busy at 6 am."
## [1] "0% busy at 7 am."
## [1] "0% busy at 8 am."
## [1] "0% busy at 9 am."
## [1] "4% busy at 10 am."
## [1] "10% busy at 11 am."
## [1] "25% busy at 12 pm."
## [1] "28% busy at 1 pm."
## [1] "33% busy at 2 pm."
## [1] "32% busy at 3 pm."
## [1] "22% busy at 4 pm."
## [1] "19% busy at 5 pm."
## [1] "24% busy at 6 pm."
## [1] "26% busy at 7 pm."
## [1] "31% busy at 8 pm."
## [1] "0% busy at 9 pm."
## [1] "0% busy at 10 pm."
## [1] "0% busy at 11 pm."

ls_Monday[1]

## [[1]]
## [1] "0% busy at 6 am."

Monday_timing=as.character(ls_Monday)


monday_df<-data.frame(Day="Monday",
                      Busy_Details=Monday_timing)

monday_df

##       Day       Busy_Details
## 1  Monday   0% busy at 6 am.
## 2  Monday   0% busy at 7 am.
## 3  Monday   0% busy at 8 am.
## 4  Monday   0% busy at 9 am.
## 5  Monday  4% busy at 10 am.
## 6  Monday 10% busy at 11 am.
## 7  Monday 25% busy at 12 pm.
## 8  Monday  28% busy at 1 pm.
## 9  Monday  33% busy at 2 pm.
## 10 Monday  32% busy at 3 pm.
## 11 Monday  22% busy at 4 pm.
## 12 Monday  19% busy at 5 pm.
## 13 Monday  24% busy at 6 pm.
## 14 Monday  26% busy at 7 pm.
## 15 Monday  31% busy at 8 pm.
## 16 Monday   0% busy at 9 pm.
## 17 Monday  0% busy at 10 pm.
## 18 Monday  0% busy at 11 pm.

Lets try and extract for Wednesday

Extracting Information from the popular times section for Wednesday

If we scroll down, we would be able to see a histogram like structure as shown below.

# Scroll up a little to make the drop down trinagle visible
for(i in 1 ){
   scrollable_div$sendKeysToElement(sendKeys = list(key = "page_up"))
  Sys.sleep(1)
}

# Run the entire section in one go

#xml for the drop down
nm1<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[14]/div[1]/div[1]/div/div/div[2]"

dropdown_value <- web_driver$findElements("xpath", value = nm1)
dropdown_value[[1]]$clickElement()
# once we do it, we would get the option to see all the days


# xml for Wednesday
nm_wednesday<-"/html/body/div[6]/div[3]/div"
dropdown_click <- web_driver$findElements("xpath", value = nm_wednesday)
dropdown_click[[1]]$clickElement()
# After running the above two section in the code, Monday appears in the day drop down


# Scroll down a little to make the chart visible
for(i in 1 ){
   scrollable_div$sendKeysToElement(sendKeys = list(key = "page_down"))
  Sys.sleep(1)
  
  # try(web_driver$executeScript("arguments[0].scrollTop = arguments[0].scrollHeight",
  #                                scrollable_div))
}

Now lets extract all the “busy at a certain time info” from the graph

Extarcting the graph for Wednesday

The graph starts from 6 AM and ends at 11 PM.Even though there are no values for lets say 6 AM till about 9.30 and then for times post 9 PM, we would still extract whatever is there in the elements

# xml for 6 AM
xml_6AM<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[14]/div[3]/div[4]/div/div[1]"

xml_7AM<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[14]/div[3]/div[4]/div/div[2]"

xml_8AM<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[14]/div[3]/div[4]/div/div[3]"

# The common xml is 
nm_common<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[14]/div[3]/div[4]/div/div"

Extracting the individual components

timing_xml <- web_driver$findElements(using = "xpath", value = nm_common)


# Getting the name using getElementText

ls_wednesday<-list()
j<-0
for(i in  1:length(timing_xml)){
  
  j<-j+1
  
  # Getting the busy details
  busy_text <- try(timing_xml[[i]]$getElementAttribute("aria-label")[[1]])
  print(busy_text)
  ls_wednesday[j]<-busy_text
}

## [1] "0% busy at 6 am."
## [1] "0% busy at 7 am."
## [1] "0% busy at 8 am."
## [1] "0% busy at 9 am."
## [1] "36% busy at 10 am."
## [1] "55% busy at 11 am."
## [1] "70% busy at 12 pm."
## [1] "55% busy at 1 pm."
## [1] "33% busy at 2 pm."
## [1] "12% busy at 3 pm."
## [1] "9% busy at 4 pm."
## [1] "16% busy at 5 pm."
## [1] "35% busy at 6 pm."
## [1] "57% busy at 7 pm."
## [1] "54% busy at 8 pm."
## [1] "0% busy at 9 pm."
## [1] "0% busy at 10 pm."
## [1] "0% busy at 11 pm."

wednesday_timing=as.character(ls_wednesday)


wednesday_df<-data.frame(Day="Wednesday",
                      Busy_Details=wednesday_timing)

wednesday_df

##          Day       Busy_Details
## 1  Wednesday   0% busy at 6 am.
## 2  Wednesday   0% busy at 7 am.
## 3  Wednesday   0% busy at 8 am.
## 4  Wednesday   0% busy at 9 am.
## 5  Wednesday 36% busy at 10 am.
## 6  Wednesday 55% busy at 11 am.
## 7  Wednesday 70% busy at 12 pm.
## 8  Wednesday  55% busy at 1 pm.
## 9  Wednesday  33% busy at 2 pm.
## 10 Wednesday  12% busy at 3 pm.
## 11 Wednesday   9% busy at 4 pm.
## 12 Wednesday  16% busy at 5 pm.
## 13 Wednesday  35% busy at 6 pm.
## 14 Wednesday  57% busy at 7 pm.
## 15 Wednesday  54% busy at 8 pm.
## 16 Wednesday   0% busy at 9 pm.
## 17 Wednesday  0% busy at 10 pm.
## 18 Wednesday  0% busy at 11 pm.

Machine Learning Made Easy

Saturday, October 26, 2024

Web Scraping Tutorial 4- Getting the busy information data from Popular time page from Google