parsel
is a framework for parallelized dynamic web-scraping using RSelenium
. Leveraging parallel processing, it allows you to run any RSelenium
web-scraping routine on multiple browser instances simultaneously, thus greatly increasing the efficiency of your scraping. parsel
utilizes chunked input processing as well as error catching and logging, to ensure seamless execution of your scraping routine and minimal data loss, even in the presence of unforeseen RSelenium
errors.
You can install the development version of parsel
from GitHub with:
The following example will hopefully serve to illustrate the functionality and ideas behind how parsel
operates. We’ll set up the following scraping job:
and parallelize it with parsel
.
parsel
requires two things:
RSelenium
instance. Actions to be executed in each browser instance should be written in the conventional RSelenium
syntax with remDr$
specifying the remote driver.x
to those actions (e.g. search terms to be entered in search boxes or links to navigate to etc.)library(RSelenium)
library(parsel)
#let's define our scraping function input
#we want to run our function 8 times and we want it to start on the wikipedia main page each time
input <- rep("https://de.wikipedia.org",4)
#let's define our scraping function
get_wiki_text <- function(x){
input_i <- x
#navigate to input page (i.e wikipedia)
remDr$navigate(input_i)
#find and click random article
rand_art <- remDr$findElement(using = "xpath", "/html/body/div[5]/div[2]/nav[1]/div/ul/li[3]/a")
rand_art$clickElement()
#get random article title
title <- remDr$findElement(using = "id", "firstHeading")
title <- title$getElementText()[[1]]
#check if there is a linked page
link_exists <- try(remDr$findElement(using = "xpath", "/html/body/div[3]/div[3]/div[5]/div[1]/p[1]/a[1]"))
#if no linked page fill output with NA
if(is(link_exists,"try-error")){
first_link_title <- NA
first_link_text <- NA
#if there is a linked page
} else {
#click on link
link <- remDr$findElement(using = "xpath", "/html/body/div[3]/div[3]/div[5]/div[1]/p[1]/a[1]")
link$clickElement()
#get link page title
title_exists <- try(remDr$findElement(using = "id", "firstHeading"))
if(is(title_exists,"try-error")){
first_link_title <- NA
}else{
first_link_title <- remDr$findElement(using = "id", "firstHeading")
first_link_title <- first_link_title$getElementText()[[1]]
}
#get 1st section of link page
text_exists <- try(remDr$findElement(using = "xpath", "/html/body/div[3]/div[3]/div[5]/div[1]/p[1]"))
if(is(text_exists,"try-error")){
first_link_text <- NA
}else{
first_link_text <- remDr$findElement(using = "xpath", "/html/body/div[3]/div[3]/div[5]/div[1]/p[1]")
first_link_text <- first_link_text$getElementText()[[1]]
}
}
out <- data.frame("random_article" = title,
"first_link_title" = first_link_title,
"first_link_text" = first_link_text)
return(out)
}
Now that we have our scrape function and input we can parallelize the execution of the function. parscrape
will show a progress bar, as well as elapsed and estimated remaining time so you can keep track of scraping progress.
wiki_text <- parsel::parscrape(scrape_fun = get_wiki_text,
scrape_input = input,
cores = 2,
packages = c("RSelenium","XML"),
browser = "firefox",
scrape_tries = 1)
parscrape
returns a list with two elements: