README

parsel is a framework for parallelized dynamic web-scraping using RSelenium. Leveraging parallel processing, it allows you to run any RSelenium web-scraping routine on multiple browser instances simultaneously, thus greatly increasing the efficiency of your scraping. parsel utilizes chunked input processing as well as error catching and logging, to ensure seamless execution of your scraping routine and minimal data loss, even in the presence of unforeseen RSelenium errors.

Installation

# install.packages("devtools")
devtools::install_github("till-tietz/parsel")

Example

The following example will hopefully serve to illustrate the functionality and ideas behind how parsel operates. We’ll set up the following scraping job:

library(RSelenium)
library(parsel)

#let's define our scraping function input 
#we want to run our function 8 times and we want it to start on the wikipedia main page each time 
input <- rep("https://de.wikipedia.org",4)

#let's define our scraping function 

get_wiki_text <- function(x){
  input_i <- x
  
  #navigate to input page (i.e wikipedia)
  remDr$navigate(input_i)
  
  #find and click random article 
  rand_art <- remDr$findElement(using = "xpath", "/html/body/div[5]/div[2]/nav[1]/div/ul/li[3]/a")
  rand_art$clickElement()
  
  #get random article title 
  title <- remDr$findElement(using = "id", "firstHeading")
  title <- title$getElementText()[[1]]
  
  #check if there is a linked page
  link_exists <- try(remDr$findElement(using = "xpath", "/html/body/div[3]/div[3]/div[5]/div[1]/p[1]/a[1]"))
  
  #if no linked page fill output with NA
  if(is(link_exists,"try-error")){
    first_link_title <- NA
    first_link_text <- NA
  
  #if there is a linked page
  } else {
    #click on link
    link <- remDr$findElement(using = "xpath", "/html/body/div[3]/div[3]/div[5]/div[1]/p[1]/a[1]")
    link$clickElement()
    
    #get link page title
    title_exists <- try(remDr$findElement(using = "id", "firstHeading"))
    if(is(title_exists,"try-error")){
      first_link_title <- NA
    }else{
      first_link_title <- remDr$findElement(using = "id", "firstHeading")
      first_link_title <- first_link_title$getElementText()[[1]]
    }
    
    #get 1st section of link page
    text_exists <- try(remDr$findElement(using = "xpath", "/html/body/div[3]/div[3]/div[5]/div[1]/p[1]"))
    if(is(text_exists,"try-error")){
      first_link_text <- NA
    }else{
      first_link_text <- remDr$findElement(using = "xpath", "/html/body/div[3]/div[3]/div[5]/div[1]/p[1]")
      first_link_text <- first_link_text$getElementText()[[1]]
    }
  }
  out <- data.frame("random_article" = title,
                    "first_link_title" = first_link_title,
                    "first_link_text" = first_link_text)
  return(out)
}

Now that we have our scrape function and input we can parallelize the execution of the function. parscrape will show a progress bar, as well as elapsed and estimated remaining time so you can keep track of scraping progress.