gutenbergr: Search and download public domain texts from Project Gutenberg

The gutenbergr package helps you download and process public domain works from the Project Gutenberg collection. This includes both tools for downloading books (and stripping header/footer information), and a complete dataset of Project Gutenberg metadata that can be used to find words of interest. Includes:

Project Gutenberg Metadata

This package contains metadata for all Project Gutenberg works as R datasets, so that you can search and filter for particular works before downloading.

The dataset gutenberg_metadata contains information about each work, pairing Gutenberg ID with title, author, language, etc:

library(gutenbergr)
gutenberg_metadata

## # A tibble: 51,997 x 8
##    gutenberg_id title  author  gutenberg_autho… language gutenberg_books… rights
##           <int> <chr>  <chr>              <int> <chr>    <chr>            <chr> 
##  1            0  <NA>  <NA>                  NA en       <NA>             Publi…
##  2            1 "The … Jeffer…             1638 en       United States L… Publi…
##  3            2 "The … United…                1 en       American Revolu… Publi…
##  4            3 "John… Kenned…             1666 en       <NA>             Publi…
##  5            4 "Linc… Lincol…                3 en       US Civil War     Publi…
##  6            5 "The … United…                1 en       American Revolu… Publi…
##  7            6 "Give… Henry,…                4 en       American Revolu… Publi…
##  8            7 "The … <NA>                  NA en       <NA>             Publi…
##  9            8 "Abra… Lincol…                3 en       US Civil War     Publi…
## 10            9 "Abra… Lincol…                3 en       US Civil War     Publi…
## # … with 51,987 more rows, and 1 more variable: has_text <lgl>

For example, you could find the Gutenberg ID of Wuthering Heights by doing:

library(dplyr)

gutenberg_metadata %>%
  filter(title == "Wuthering Heights")

## # A tibble: 1 x 8
##   gutenberg_id title  author  gutenberg_autho… language gutenberg_booksh… rights
##          <int> <chr>  <chr>              <int> <chr>    <chr>             <chr> 
## 1          768 Wuthe… Brontë…              405 en       Gothic Fiction/M… Publi…
## # … with 1 more variable: has_text <lgl>

In many analyses, you may want to filter just for English works, avoid duplicates, and include only books that have text that can be downloaded. The gutenberg_works() function does this pre-filtering:

gutenberg_works()

## # A tibble: 40,737 x 8
##    gutenberg_id title  author  gutenberg_autho… language gutenberg_books… rights
##           <int> <chr>  <chr>              <int> <chr>    <chr>            <chr> 
##  1            0  <NA>  <NA>                  NA en       <NA>             Publi…
##  2            1 "The … Jeffer…             1638 en       United States L… Publi…
##  3            2 "The … United…                1 en       American Revolu… Publi…
##  4            3 "John… Kenned…             1666 en       <NA>             Publi…
##  5            4 "Linc… Lincol…                3 en       US Civil War     Publi…
##  6            5 "The … United…                1 en       American Revolu… Publi…
##  7            6 "Give… Henry,…                4 en       American Revolu… Publi…
##  8            7 "The … <NA>                  NA en       <NA>             Publi…
##  9            8 "Abra… Lincol…                3 en       US Civil War     Publi…
## 10            9 "Abra… Lincol…                3 en       US Civil War     Publi…
## # … with 40,727 more rows, and 1 more variable: has_text <lgl>

It also allows you to perform filtering as an argument:

gutenberg_works(author == "Austen, Jane")

## # A tibble: 10 x 8
##    gutenberg_id title   author gutenberg_autho… language gutenberg_books… rights
##           <int> <chr>   <chr>             <int> <chr>    <chr>            <chr> 
##  1          105 "Persu… Auste…               68 en       <NA>             Publi…
##  2          121 "North… Auste…               68 en       Gothic Fiction   Publi…
##  3          141 "Mansf… Auste…               68 en       <NA>             Publi…
##  4          158 "Emma"  Auste…               68 en       <NA>             Publi…
##  5          161 "Sense… Auste…               68 en       <NA>             Publi…
##  6          946 "Lady … Auste…               68 en       <NA>             Publi…
##  7         1212 "Love … Auste…               68 en       <NA>             Publi…
##  8         1342 "Pride… Auste…               68 en       Best Books Ever… Publi…
##  9        31100 "The C… Auste…               68 en       <NA>             Publi…
## 10        42078 "The L… Auste…               68 en       <NA>             Publi…
## # … with 1 more variable: has_text <lgl>

# or with a regular expression

library(stringr)
gutenberg_works(str_detect(author, "Austen"))

## # A tibble: 13 x 8
##    gutenberg_id title   author gutenberg_autho… language gutenberg_books… rights
##           <int> <chr>   <chr>             <int> <chr>    <chr>            <chr> 
##  1          105 "Persu… Auste…               68 en       <NA>             Publi…
##  2          121 "North… Auste…               68 en       Gothic Fiction   Publi…
##  3          141 "Mansf… Auste…               68 en       <NA>             Publi…
##  4          158 "Emma"  Auste…               68 en       <NA>             Publi…
##  5          161 "Sense… Auste…               68 en       <NA>             Publi…
##  6          946 "Lady … Auste…               68 en       <NA>             Publi…
##  7         1212 "Love … Auste…               68 en       <NA>             Publi…
##  8         1342 "Pride… Auste…               68 en       Best Books Ever… Publi…
##  9        17797 "Memoi… Auste…             7603 en       <NA>             Publi…
## 10        31100 "The C… Auste…               68 en       <NA>             Publi…
## 11        33513 "The F… Auste…            36446 en       <NA>             Publi…
## 12        39897 "Disco… Layar…            40288 en       <NA>             Publi…
## 13        42078 "The L… Auste…               68 en       <NA>             Publi…
## # … with 1 more variable: has_text <lgl>

The meta-data currently in the package was last updated on 05 May 2016.

Downloading books by ID

The function gutenberg_download() downloads one or more works from Project Gutenberg based on their ID. For example, we earlier saw that “Wuthering Heights” has ID 768 (see the URL here), so gutenberg_download(768) downloads this text.

f768 <- system.file("extdata", "768.zip", package = "gutenbergr")
wuthering_heights <- gutenberg_download(768,
                                        files = f768,
                                        mirror = "http://aleph.gutenberg.org")

wuthering_heights <- gutenberg_download(768)

wuthering_heights

## # A tibble: 12,085 x 2
##    gutenberg_id text                                                            
##           <int> <chr>                                                           
##  1          768 "WUTHERING HEIGHTS"                                             
##  2          768 ""                                                              
##  3          768 ""                                                              
##  4          768 "CHAPTER I"                                                     
##  5          768 ""                                                              
##  6          768 ""                                                              
##  7          768 "1801.--I have just returned from a visit to my landlord--the s…
##  8          768 "neighbour that I shall be troubled with.  This is certainly a …
##  9          768 "country!  In all England, I do not believe that I could have f…
## 10          768 "situation so completely removed from the stir of society.  A p…
## # … with 12,075 more rows

Notice it is returned as a tbl_df (a type of data frame) including two variables: gutenberg_id (useful if multiple books are returned), and a character vector of the text, one row per line. Notice that the header and footer added by Project Gutenberg (visible here) have been stripped away.

Provide a vector of IDs to download multiple books. For example, to download Jane Eyre (book 1260) along with Wuthering Heights, do:

f1260 <- system.file("extdata", "1260.zip", package = "gutenbergr")
books <- gutenberg_download(c(768, 1260),
                            meta_fields = "title",
                            files = c(f768, f1260),
                            mirror = "http://aleph.gutenberg.org")

books <- gutenberg_download(c(768, 1260), meta_fields = "title")

books

## # A tibble: 32,744 x 3
##    gutenberg_id text                                                title       
##           <int> <chr>                                               <chr>       
##  1          768 "WUTHERING HEIGHTS"                                 Wuthering H…
##  2          768 ""                                                  Wuthering H…
##  3          768 ""                                                  Wuthering H…
##  4          768 "CHAPTER I"                                         Wuthering H…
##  5          768 ""                                                  Wuthering H…
##  6          768 ""                                                  Wuthering H…
##  7          768 "1801.--I have just returned from a visit to my la… Wuthering H…
##  8          768 "neighbour that I shall be troubled with.  This is… Wuthering H…
##  9          768 "country!  In all England, I do not believe that I… Wuthering H…
## 10          768 "situation so completely removed from the stir of … Wuthering H…
## # … with 32,734 more rows

Notice that the meta_fields argument allows us to add one or more additional fields from the gutenberg_metadata to the downloaded text, such as title or author.

books %>%
  count(title)

## # A tibble: 2 x 2
##   title                           n
##   <chr>                       <int>
## 1 Jane Eyre: An Autobiography 20659
## 2 Wuthering Heights           12085

Other meta-datasets

You may want to select books based on information other than their title or author, such as their genre or topic. gutenberg_subjects contains pairings of works with Library of Congress subjects and topics. “lcc” means Library of Congress Classification, while “lcsh” means Library of Congress subject headings:

gutenberg_subjects

## # A tibble: 140,173 x 3
##    gutenberg_id subject_type subject                                            
##           <int> <chr>        <chr>                                              
##  1            1 lcc          E201                                               
##  2            1 lcsh         United States. Declaration of Independence         
##  3            1 lcsh         United States -- History -- Revolution, 1775-1783 …
##  4            1 lcc          JK                                                 
##  5            2 lcc          KF                                                 
##  6            2 lcsh         Civil rights -- United States -- Sources           
##  7            2 lcsh         United States. Constitution. 1st-10th Amendments   
##  8            2 lcc          JK                                                 
##  9            3 lcsh         Presidents -- United States -- Inaugural addresses 
## 10            3 lcsh         United States -- Foreign relations -- 1961-1963    
## # … with 140,163 more rows

This is useful for extracting texts from a particular topic or genre, such as detective stories, or a particular character, such as Sherlock Holmes. The gutenberg_id column can then be used to download these texts or to link with other metadata.

gutenberg_subjects %>%
  filter(subject == "Detective and mystery stories")

## # A tibble: 521 x 3
##    gutenberg_id subject_type subject                      
##           <int> <chr>        <chr>                        
##  1          170 lcsh         Detective and mystery stories
##  2          173 lcsh         Detective and mystery stories
##  3          244 lcsh         Detective and mystery stories
##  4          305 lcsh         Detective and mystery stories
##  5          330 lcsh         Detective and mystery stories
##  6          481 lcsh         Detective and mystery stories
##  7          547 lcsh         Detective and mystery stories
##  8          863 lcsh         Detective and mystery stories
##  9          905 lcsh         Detective and mystery stories
## 10         1155 lcsh         Detective and mystery stories
## # … with 511 more rows

gutenberg_subjects %>%
  filter(grepl("Holmes, Sherlock", subject))

## # A tibble: 47 x 3
##    gutenberg_id subject_type subject                                           
##           <int> <chr>        <chr>                                             
##  1          108 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
##  2          221 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
##  3          244 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
##  4          834 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
##  5         1661 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
##  6         2097 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
##  7         2343 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
##  8         2344 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
##  9         2345 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
## 10         2346 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
## # … with 37 more rows

gutenberg_authors contains information about each author, such as aliases and birth/death year:

gutenberg_authors

## # A tibble: 16,236 x 7
##    gutenberg_author… author    alias  birthdate deathdate wikipedia   aliases   
##                <int> <chr>     <chr>      <int>     <int> <chr>       <chr>     
##  1                 1 United S… <NA>          NA        NA <NA>        <NA>      
##  2                 3 Lincoln,… <NA>        1809      1865 http://en.… United St…
##  3                 4 Henry, P… <NA>        1736      1799 http://en.… <NA>      
##  4                 5 Adam, Pa… <NA>          NA        NA <NA>        <NA>      
##  5                 7 Carroll,… Dodgs…      1832      1898 http://en.… <NA>      
##  6                 8 United S… <NA>          NA        NA <NA>        Agency, U…
##  7                 9 Melville… Melvi…      1819      1891 http://en.… <NA>      
##  8                10 Barrie, … Barri…      1860      1937 http://en.… <NA>      
##  9                12 Smith, J… Smith…      1805      1844 http://en.… <NA>      
## 10                14 Madison,… Unite…      1751      1836 http://en.… <NA>      
## # … with 16,226 more rows

Analysis

What’s next after retrieving a book’s text? Well, having the book as a data frame is especially useful for working with the tidytext package for text analysis.

library(tidytext)

words <- books %>%
  unnest_tokens(word, text)

words

## # A tibble: 305,532 x 3
##    gutenberg_id title             word     
##           <int> <chr>             <chr>    
##  1          768 Wuthering Heights wuthering
##  2          768 Wuthering Heights heights  
##  3          768 Wuthering Heights chapter  
##  4          768 Wuthering Heights i        
##  5          768 Wuthering Heights 1801     
##  6          768 Wuthering Heights i        
##  7          768 Wuthering Heights have     
##  8          768 Wuthering Heights just     
##  9          768 Wuthering Heights returned 
## 10          768 Wuthering Heights from     
## # … with 305,522 more rows

word_counts <- words %>%
  anti_join(stop_words, by = "word") %>%
  count(title, word, sort = TRUE)

word_counts

## # A tibble: 21,200 x 3
##    title                       word           n
##    <chr>                       <chr>      <int>
##  1 Wuthering Heights           heathcliff   421
##  2 Wuthering Heights           linton       346
##  3 Jane Eyre: An Autobiography jane         342
##  4 Wuthering Heights           catherine    336
##  5 Jane Eyre: An Autobiography rochester    317
##  6 Jane Eyre: An Autobiography sir          315
##  7 Jane Eyre: An Autobiography miss         310
##  8 Jane Eyre: An Autobiography time         244
##  9 Jane Eyre: An Autobiography day          232
## 10 Jane Eyre: An Autobiography looked       221
## # … with 21,190 more rows

You may also find these resources useful:

The Natural Language Processing CRAN View suggests many R packages related to text mining, especially around the tm package
You could match the wikipedia column in gutenberg_author to Wikipedia content with the WikipediR package or to pageview statistics with the wikipediatrend package
If you’re considering an analysis based on author name, you may find the humaniformat (for extraction of first names) and gender (prediction of gender from first names) packages useful. (Note that humaniformat has a format_reverse function for reversing “Last, First” names).

gutenbergr: Search and download public domain texts from Project Gutenberg

David Robinson

2021-05-28

Project Gutenberg Metadata

Downloading books by ID

Other meta-datasets

Analysis