Web Scraping using Rselenium
Parag Verma and Wen Long
2023-12-13
Basics of Web Scraping
Web Scraping is process of extracting useful information from a website or URL.This information can be in the form of text, tables, embedded links,ratings, etc. It is a very handy tool when one wants to supplement the existing information of country demographics, customer preferences, store location, etc.
How to scrape data from a website
Data can be scrapped in two ways:
- One from a static website(which doesnt change often).Examples of
this include wikipedia page, govt websites, Company e-site.
- Second from a dynamic website(such as google pages,shopee, etc).Here the content is masked through Java script and Jqeury and hence we cant use the html way(using specific tags) of extracting information
In this blog, we will look at how to scrape data for the second option.We will use what is known as headless browsing.A headless browser enables you to load a website without a GUI and all the actions are implemented using a command line interface. I wont go into the detail of it as it would murk the purpose of the blog. In R, we use Rselenium package that helps in headless browsing
Step 0: Importing the libraries
package.name<-c("tidyverse","RSelenium")
for(i in package.name){
if(!require(i,character.only = T)){
install.packages(i)
}
library(i,character.only = T)
}
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: RSelenium
Step 1: Extracting names and address of all Jumbo Vada Pav stores in Mulund West, Mumbai
Lets say we want to extract the names of all the stores and their location from Mulund West. Lets see what we get when we search this on google
we can see from the image that there are two Jumbo Vada Pav stores in Mulund West Mumbai. I have taken this example because we can grasp the concept easily with just two stores.
Step 2: How would Rselnium do web scraping for these two stores
Rselenium would perform the following basic steps:
- Start a headless browser
- Navigate to the google map page(shown above)
- Get the url(links) of each of these stores
- Navigate on each of these links
- Get the xpath for the store name and address
- For each of the xpaths(names and address), get the element sitting at these locations
So as th first step, we will start a headless browser.Firefox works fine in my system so I would go with Firefox browser
Step 3: Start a headless Firefox browser
The syntax for initiating a headless Firefox browser is shown below
driver <- rsDriver(
browser = c("firefox"),
chromever = NULL,
verbose = F,
extraCapabilities = list("firefoxOptions" = list(args = list("--headless")))
)
web_driver <- driver[["client"]]
Once I execute this, Firefox browser would pop up in the background as shown below.
## Step 3: Navigate to the google map page for Jumbo Vada Pav
We will now use the Firefox browser to navigate to the google map page for Jumbo Vada Pav
nm<-"Jumbo wada pav mulund west "
ad_url<-str_c("https://www.google.co.id/maps/search/ ",nm)
# Now navigate to the URL.This is for the browser to go to that location
web_driver$navigate(ad_url)
Once I execute the above, Firefox browser would go to the Jumbo Vada Pav page.
Step 4: Get the url(links) of each of these stores
In order to gt the link, we have to right click on the first store and click on inspect
Once you click on inspect, you would be directed to the highlighted portion. The a tag indicates that it is a link
Right click on the highlighted portion and copy the xml path as shown below
The XML path would look something like this
/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[1]/div[1]/div[4]/div/a
Similarly, if I follow the above steps and get the XML path for the second store, then it would be something like the below
/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[1]/div[1]/div[6]/div/a
The difference between the two is only wrt to the penultimate div element. For the first store it is div[4] and for the second store it is div[6]
Now we will use this information to extract all the links. For each of these XML path, we need to get the href(url)
The penultimate div which is the only difference between store1 and store2 XML paths will be specified as div(instead of div[4] or div[6]) and then each of these elements, we will extract the href using getElementAttribute function
# l1<-list()
link_Store <- web_driver$findElements(using = "xpath", value = "/html/body/div[2]/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[1]/div[1]/div/div/a")
# print(store)
l1<-list()
for(i in 1:length(link_Store) ){
l1[[i]]<-link_Store[[i]]$getElementAttribute("href")[[1]]
}
l1
## [[1]]
## [1] "https://www.google.co.id/maps/place/JUMBO+KING+MULUND/data=!4m7!3m6!1s0x3be7b9f243892907:0xc1c58fde55a52ab8!8m2!3d19.1716851!4d72.9552551!16s%2Fg%2F11c6lk8nbl!19sChIJBymJQ_K55zsRuCqlVd6PxcE?authuser=0&hl=en&rclk=1"
##
## [[2]]
## [1] "https://www.google.co.id/maps/place/Jumbo+Vada+Pav/data=!4m7!3m6!1s0x3be7b9ee2763f83f:0x6a56910364c6346b!8m2!3d19.1722852!4d72.9559067!16s%2Fg%2F11t10_mhq2!19sChIJP_hjJ-655zsRazTGZAORVmo?authuser=0&hl=en&rclk=1"
We can see that the url for the two stores are now stored in l1 list. We will use these links to navigate to individual store site and then extract the store name and address
Step 6: Creating the final data frame
final.df<-data.frame(Store_Name=as.character(l2),
Store_Address=as.character(l3))
final.df
## Store_Name
## 1 JUMBO KING MULUND
## 2 Jumbo Vada Pav
## Store_Address
## 1 Shop no 1, Sardar Vallabhbhai Patel Rd, Mulund West, Mumbai, Maharashtra 400080, India
## 2 8, Sardar Vallabhbhai Patel Rd, Vidya Vihar, Mulund West, Mumbai, Maharashtra 400080, India