Machine Learning Made Easy: December 2023

Web Scraping using Rselenium

Basics of Web Scraping

Web Scraping is process of extracting useful information from a website or URL.This information can be in the form of text, tables, embedded links,ratings, etc. It is a very handy tool when one wants to supplement the existing information of country demographics, customer preferences, store location, etc.

How to scrape data from a website

Data can be scrapped in two ways:

One from a static website(which doesnt change often).Examples of this include wikipedia page, govt websites, Company e-site.
Second from a dynamic website(such as google pages,shopee, etc).Here the content is masked through Java script and Jqeury and hence we cant use the html way(using specific tags) of extracting information

In this blog, we will look at how to scrape data for the second option.We will use what is known as headless browsing.A headless browser enables you to load a website without a GUI and all the actions are implemented using a command line interface. I wont go into the detail of it as it would murk the purpose of the blog. In R, we use Rselenium package that helps in headless browsing

Step 0: Importing the libraries

package.name<-c("tidyverse","RSelenium")

for(i in package.name){

  if(!require(i,character.only = T)){

    install.packages(i)
  }
  library(i,character.only = T)

}

## Loading required package: tidyverse

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: RSelenium

Step 1: Extracting names and address of all Jumbo Vada Pav stores in Mulund West, Mumbai

Lets say we want to extract the names of all the stores and their location from Mulund West. Lets see what we get when we search this on google

we can see from the image that there are two Jumbo Vada Pav stores in Mulund West Mumbai. I have taken this example because we can grasp the concept easily with just two stores.

Step 2: How would Rselnium do web scraping for these two stores

Rselenium would perform the following basic steps:

Start a headless browser
Navigate to the google map page(shown above)
Get the url(links) of each of these stores
Navigate on each of these links
Get the xpath for the store name and address
For each of the xpaths(names and address), get the element sitting at these locations

So as th first step, we will start a headless browser.Firefox works fine in my system so I would go with Firefox browser

Step 3: Start a headless Firefox browser

The syntax for initiating a headless Firefox browser is shown below

driver <- rsDriver( 
  browser = c("firefox"), 
  chromever = NULL, 
  verbose = F, 
  extraCapabilities = list("firefoxOptions" = list(args = list("--headless"))) 
) 
web_driver <- driver[["client"]]

Once I execute this, Firefox browser would pop up in the background as shown below.

## Step 3: Navigate to the google map page for Jumbo Vada Pav

We will now use the Firefox browser to navigate to the google map page for Jumbo Vada Pav

nm<-"Jumbo wada pav mulund west "
ad_url<-str_c("https://www.google.co.id/maps/search/ ",nm)

# Now navigate to the URL.This is for the browser to go to that location
web_driver$navigate(ad_url)

Once I execute the above, Firefox browser would go to the Jumbo Vada Pav page.

Step 4: Get the url(links) of each of these stores

In order to gt the link, we have to right click on the first store and click on inspect

Once you click on inspect, you would be directed to the highlighted portion. The a tag indicates that it is a link

Right click on the highlighted portion and copy the xml path as shown below

The XML path would look something like this

/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[1]/div[1]/div[4]/div/a

Similarly, if I follow the above steps and get the XML path for the second store, then it would be something like the below

/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[1]/div[1]/div[6]/div/a

The difference between the two is only wrt to the penultimate div element. For the first store it is div[4] and for the second store it is div[6]

Now we will use this information to extract all the links. For each of these XML path, we need to get the href(url)

The penultimate div which is the only difference between store1 and store2 XML paths will be specified as div(instead of div[4] or div[6]) and then each of these elements, we will extract the href using getElementAttribute function

# l1<-list()
link_Store <- web_driver$findElements(using = "xpath", value = "/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[1]/div[1]/div/div/a")

# print(store)
l1<-list()
for(i in 1:length(link_Store) ){
  
  l1[[i]]<-link_Store[[i]]$getElementAttribute("href")[[1]]
  
}
l1

## [[1]]
## [1] "https://www.google.co.id/maps/place/JUMBO+KING+MULUND/data=!4m7!3m6!1s0x3be7b9f243892907:0xc1c58fde55a52ab8!8m2!3d19.1716851!4d72.9552551!16s%2Fg%2F11c6lk8nbl!19sChIJBymJQ_K55zsRuCqlVd6PxcE?authuser=0&hl=en&rclk=1"
## 
## [[2]]
## [1] "https://www.google.co.id/maps/place/Jumbo+Vada+Pav/data=!4m7!3m6!1s0x3be7b9ee2763f83f:0x6a56910364c6346b!8m2!3d19.1722852!4d72.9559067!16s%2Fg%2F11t10_mhq2!19sChIJP_hjJ-655zsRazTGZAORVmo?authuser=0&hl=en&rclk=1"

We can see that the url for the two stores are now stored in l1 list. We will use these links to navigate to individual store site and then extract the store name and address

Step 5: Navigate on each of these links

# Navigate to the first store site
web_driver$navigate(l1[[1]])

You will notice that the Firefox page will go to the first store page as shown below

Now we have to right click on the Store name and extract the XML path

Copy the XML path

The XML path comes out to be

/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[2]/div/div[1]/div[1]/h1

If we similarly extract the XML path for th second store, we get the following path

/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[2]/div/div[1]/div[1]/h1

We can see that the two XML paths are same, So it means that the name are located at the same XML path. This is a good thing as it simplifies our code a little bit

Step 5A: Get the Store Name

We will now loop through each store url and get the store name

l1

## [[1]]
## [1] "https://www.google.co.id/maps/place/JUMBO+KING+MULUND/data=!4m7!3m6!1s0x3be7b9f243892907:0xc1c58fde55a52ab8!8m2!3d19.1716851!4d72.9552551!16s%2Fg%2F11c6lk8nbl!19sChIJBymJQ_K55zsRuCqlVd6PxcE?authuser=0&hl=en&rclk=1"
## 
## [[2]]
## [1] "https://www.google.co.id/maps/place/Jumbo+Vada+Pav/data=!4m7!3m6!1s0x3be7b9ee2763f83f:0x6a56910364c6346b!8m2!3d19.1722852!4d72.9559067!16s%2Fg%2F11t10_mhq2!19sChIJP_hjJ-655zsRazTGZAORVmo?authuser=0&hl=en&rclk=1"

l2<-list()
j<-0

for(x in l1){
  
  j<-j+1
  
    # Now navigate to the URL for the store
  web_driver$navigate(x)
  
  # Th XML path where the store name is located is same for both the stores
  nm<-"/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[2]/div/div[1]/div[1]/h1"
  
  # Getting the Store Name
  store_nm <- try(web_driver$findElements(using = "xpath", value = nm)[[1]]$getElementText()[[1]])
  print(store_nm)
  l2[j]<-store_nm

}

## [1] "JUMBO KING MULUND"
## [1] "Jumbo Vada Pav"

as.character(l2)

## [1] "JUMBO KING MULUND" "Jumbo Vada Pav"

Step 5B: Get the Store Address

We will now loop through each store url and get the store address

For this we need to check the XML path for the both the addresses

XML for Address 1: /html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[9]/div[3]/button/div/div[2]/div[1]

XML for Address 2: /html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[7]/div[3]/button/div/div[2]/div[1]

We can see that the XML path for the addresses are a little different as shown below

Hence we will handle it by removing the positional indices from the dev element just like we did for l1

l3<-list()
k<-0

for(x in l1){
  
  k<-k+1
  
  # Now navigate to the URL for the store
  web_driver$navigate(x)
  
  # Th XML path where the store address is located 
  nm<-"html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div/div[3]/button/div/div[2]/div[1]"
  
  # Getting the Store Name
  store_add <- try(web_driver$findElements(using = "xpath", value = nm)[[1]]$getElementText()[[1]])
  print(store_add)
  l3[k]<-store_add
  
}

## [1] "Shop no 1, Sardar Vallabhbhai Patel Rd, Mulund West, Mumbai, Maharashtra 400080, India"
## [1] "8, Sardar Vallabhbhai Patel Rd, Vidya Vihar, Mulund West, Mumbai, Maharashtra 400080, India"

l3

## [[1]]
## [1] "Shop no 1, Sardar Vallabhbhai Patel Rd, Mulund West, Mumbai, Maharashtra 400080, India"
## 
## [[2]]
## [1] "8, Sardar Vallabhbhai Patel Rd, Vidya Vihar, Mulund West, Mumbai, Maharashtra 400080, India"

as.character(l3)

## [1] "Shop no 1, Sardar Vallabhbhai Patel Rd, Mulund West, Mumbai, Maharashtra 400080, India"     
## [2] "8, Sardar Vallabhbhai Patel Rd, Vidya Vihar, Mulund West, Mumbai, Maharashtra 400080, India"

Step 6: Creating the final data frame

final.df<-data.frame(Store_Name=as.character(l2),
                     Store_Address=as.character(l3))

final.df

##          Store_Name
## 1 JUMBO KING MULUND
## 2    Jumbo Vada Pav
##                                                                                 Store_Address
## 1      Shop no 1, Sardar Vallabhbhai Patel Rd, Mulund West, Mumbai, Maharashtra 400080, India
## 2 8, Sardar Vallabhbhai Patel Rd, Vidya Vihar, Mulund West, Mumbai, Maharashtra 400080, India

Machine Learning Made Easy

Wednesday, December 13, 2023

Web Scraping using RSelenium - Tutorial 1