Wednesday, December 13, 2023

Web Scraping using RSelenium - Tutorial 1

Web Scraping using Rselenium

Basics of Web Scraping

Web Scraping is process of extracting useful information from a website or URL.This information can be in the form of text, tables, embedded links,ratings, etc. It is a very handy tool when one wants to supplement the existing information of country demographics, customer preferences, store location, etc.


How to scrape data from a website

Data can be scrapped in two ways:

  • One from a static website(which doesnt change often).Examples of this include wikipedia page, govt websites, Company e-site.
  • Second from a dynamic website(such as google pages,shopee, etc).Here the content is masked through Java script and Jqeury and hence we cant use the html way(using specific tags) of extracting information

In this blog, we will look at how to scrape data for the second option.We will use what is known as headless browsing.A headless browser enables you to load a website without a GUI and all the actions are implemented using a command line interface. I wont go into the detail of it as it would murk the purpose of the blog. In R, we use Rselenium package that helps in headless browsing

Step 0: Importing the libraries

package.name<-c("tidyverse","RSelenium")

for(i in package.name){

  if(!require(i,character.only = T)){

    install.packages(i)
  }
  library(i,character.only = T)

}
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: RSelenium


Step 1: Extracting names and address of all Jumbo Vada Pav stores in Mulund West, Mumbai

Lets say we want to extract the names of all the stores and their location from Mulund West. Lets see what we get when we search this on google

we can see from the image that there are two Jumbo Vada Pav stores in Mulund West Mumbai. I have taken this example because we can grasp the concept easily with just two stores.


Step 2: How would Rselnium do web scraping for these two stores

Rselenium would perform the following basic steps:

  • Start a headless browser
  • Navigate to the google map page(shown above)
  • Get the url(links) of each of these stores
  • Navigate on each of these links
  • Get the xpath for the store name and address
  • For each of the xpaths(names and address), get the element sitting at these locations


So as th first step, we will start a headless browser.Firefox works fine in my system so I would go with Firefox browser


Step 3: Start a headless Firefox browser

The syntax for initiating a headless Firefox browser is shown below

driver <- rsDriver( 
  browser = c("firefox"), 
  chromever = NULL, 
  verbose = F, 
  extraCapabilities = list("firefoxOptions" = list(args = list("--headless"))) 
) 
web_driver <- driver[["client"]] 

Once I execute this, Firefox browser would pop up in the background as shown below.

## Step 3: Navigate to the google map page for Jumbo Vada Pav

We will now use the Firefox browser to navigate to the google map page for Jumbo Vada Pav

nm<-"Jumbo wada pav mulund west "
ad_url<-str_c("https://www.google.co.id/maps/search/ ",nm)

# Now navigate to the URL.This is for the browser to go to that location
web_driver$navigate(ad_url)


Once I execute the above, Firefox browser would go to the Jumbo Vada Pav page.


Step 6: Creating the final data frame

final.df<-data.frame(Store_Name=as.character(l2),
                     Store_Address=as.character(l3))

final.df
##          Store_Name
## 1 JUMBO KING MULUND
## 2    Jumbo Vada Pav
##                                                                                 Store_Address
## 1      Shop no 1, Sardar Vallabhbhai Patel Rd, Mulund West, Mumbai, Maharashtra 400080, India
## 2 8, Sardar Vallabhbhai Patel Rd, Vidya Vihar, Mulund West, Mumbai, Maharashtra 400080, India

No comments:

Post a Comment

Web Scraping Tutorial 4- Getting the busy information data from Popular time page from Google

Popular Times Popular Times In this blog we will try to scrape the ...