gutenbergr: Search and download public domain texts from Project Gutenberg

David Robinson

2021-05-28

The gutenbergr package helps you download and process public domain works from the Project Gutenberg collection. This includes both tools for downloading books (and stripping header/footer information), and a complete dataset of Project Gutenberg metadata that can be used to find words of interest. Includes:

Project Gutenberg Metadata

This package contains metadata for all Project Gutenberg works as R datasets, so that you can search and filter for particular works before downloading.

The dataset gutenberg_metadata contains information about each work, pairing Gutenberg ID with title, author, language, etc:

library(gutenbergr)
gutenberg_metadata
## # A tibble: 51,997 x 8
##    gutenberg_id title  author  gutenberg_autho… language gutenberg_books… rights
##           <int> <chr>  <chr>              <int> <chr>    <chr>            <chr> 
##  1            0  <NA>  <NA>                  NA en       <NA>             Publi…
##  2            1 "The … Jeffer…             1638 en       United States L… Publi…
##  3            2 "The … United…                1 en       American Revolu… Publi…
##  4            3 "John… Kenned…             1666 en       <NA>             Publi…
##  5            4 "Linc… Lincol…                3 en       US Civil War     Publi…
##  6            5 "The … United…                1 en       American Revolu… Publi…
##  7            6 "Give… Henry,…                4 en       American Revolu… Publi…
##  8            7 "The … <NA>                  NA en       <NA>             Publi…
##  9            8 "Abra… Lincol…                3 en       US Civil War     Publi…
## 10            9 "Abra… Lincol…                3 en       US Civil War     Publi…
## # … with 51,987 more rows, and 1 more variable: has_text <lgl>

For example, you could find the Gutenberg ID of Wuthering Heights by doing:

library(dplyr)

gutenberg_metadata %>%
  filter(title == "Wuthering Heights")
## # A tibble: 1 x 8
##   gutenberg_id title  author  gutenberg_autho… language gutenberg_booksh… rights
##          <int> <chr>  <chr>              <int> <chr>    <chr>             <chr> 
## 1          768 Wuthe… Brontë…              405 en       Gothic Fiction/M… Publi…
## # … with 1 more variable: has_text <lgl>

In many analyses, you may want to filter just for English works, avoid duplicates, and include only books that have text that can be downloaded. The gutenberg_works() function does this pre-filtering:

gutenberg_works()
## # A tibble: 40,737 x 8
##    gutenberg_id title  author  gutenberg_autho… language gutenberg_books… rights
##           <int> <chr>  <chr>              <int> <chr>    <chr>            <chr> 
##  1            0  <NA>  <NA>                  NA en       <NA>             Publi…
##  2            1 "The … Jeffer…             1638 en       United States L… Publi…
##  3            2 "The … United…                1 en       American Revolu… Publi…
##  4            3 "John… Kenned…             1666 en       <NA>             Publi…
##  5            4 "Linc… Lincol…                3 en       US Civil War     Publi…
##  6            5 "The … United…                1 en       American Revolu… Publi…
##  7            6 "Give… Henry,…                4 en       American Revolu… Publi…
##  8            7 "The … <NA>                  NA en       <NA>             Publi…
##  9            8 "Abra… Lincol…                3 en       US Civil War     Publi…
## 10            9 "Abra… Lincol…                3 en       US Civil War     Publi…
## # … with 40,727 more rows, and 1 more variable: has_text <lgl>

It also allows you to perform filtering as an argument:

gutenberg_works(author == "Austen, Jane")
## # A tibble: 10 x 8
##    gutenberg_id title   author gutenberg_autho… language gutenberg_books… rights
##           <int> <chr>   <chr>             <int> <chr>    <chr>            <chr> 
##  1          105 "Persu… Auste…               68 en       <NA>             Publi…
##  2          121 "North… Auste…               68 en       Gothic Fiction   Publi…
##  3          141 "Mansf… Auste…               68 en       <NA>             Publi…
##  4          158 "Emma"  Auste…               68 en       <NA>             Publi…
##  5          161 "Sense… Auste…               68 en       <NA>             Publi…
##  6          946 "Lady … Auste…               68 en       <NA>             Publi…
##  7         1212 "Love … Auste…               68 en       <NA>             Publi…
##  8         1342 "Pride… Auste…               68 en       Best Books Ever… Publi…
##  9        31100 "The C… Auste…               68 en       <NA>             Publi…
## 10        42078 "The L… Auste…               68 en       <NA>             Publi…
## # … with 1 more variable: has_text <lgl>
# or with a regular expression

library(stringr)
gutenberg_works(str_detect(author, "Austen"))
## # A tibble: 13 x 8
##    gutenberg_id title   author gutenberg_autho… language gutenberg_books… rights
##           <int> <chr>   <chr>             <int> <chr>    <chr>            <chr> 
##  1          105 "Persu… Auste…               68 en       <NA>             Publi…
##  2          121 "North… Auste…               68 en       Gothic Fiction   Publi…
##  3          141 "Mansf… Auste…               68 en       <NA>             Publi…
##  4          158 "Emma"  Auste…               68 en       <NA>             Publi…
##  5          161 "Sense… Auste…               68 en       <NA>             Publi…
##  6          946 "Lady … Auste…               68 en       <NA>             Publi…
##  7         1212 "Love … Auste…               68 en       <NA>             Publi…
##  8         1342 "Pride… Auste…               68 en       Best Books Ever… Publi…
##  9        17797 "Memoi… Auste…             7603 en       <NA>             Publi…
## 10        31100 "The C… Auste…               68 en       <NA>             Publi…
## 11        33513 "The F… Auste…            36446 en       <NA>             Publi…
## 12        39897 "Disco… Layar…            40288 en       <NA>             Publi…
## 13        42078 "The L… Auste…               68 en       <NA>             Publi…
## # … with 1 more variable: has_text <lgl>

The meta-data currently in the package was last updated on 05 May 2016.

Downloading books by ID

The function gutenberg_download() downloads one or more works from Project Gutenberg based on their ID. For example, we earlier saw that “Wuthering Heights” has ID 768 (see the URL here), so gutenberg_download(768) downloads this text.

f768 <- system.file("extdata", "768.zip", package = "gutenbergr")
wuthering_heights <- gutenberg_download(768,
                                        files = f768,
                                        mirror = "http://aleph.gutenberg.org")
wuthering_heights <- gutenberg_download(768)
wuthering_heights
## # A tibble: 12,085 x 2
##    gutenberg_id text                                                            
##           <int> <chr>                                                           
##  1          768 "WUTHERING HEIGHTS"                                             
##  2          768 ""                                                              
##  3          768 ""                                                              
##  4          768 "CHAPTER I"                                                     
##  5          768 ""                                                              
##  6          768 ""                                                              
##  7          768 "1801.--I have just returned from a visit to my landlord--the s…
##  8          768 "neighbour that I shall be troubled with.  This is certainly a …
##  9          768 "country!  In all England, I do not believe that I could have f…
## 10          768 "situation so completely removed from the stir of society.  A p…
## # … with 12,075 more rows

Notice it is returned as a tbl_df (a type of data frame) including two variables: gutenberg_id (useful if multiple books are returned), and a character vector of the text, one row per line. Notice that the header and footer added by Project Gutenberg (visible here) have been stripped away.

Provide a vector of IDs to download multiple books. For example, to download Jane Eyre (book 1260) along with Wuthering Heights, do:

f1260 <- system.file("extdata", "1260.zip", package = "gutenbergr")
books <- gutenberg_download(c(768, 1260),
                            meta_fields = "title",
                            files = c(f768, f1260),
                            mirror = "http://aleph.gutenberg.org")
books <- gutenberg_download(c(768, 1260), meta_fields = "title")
books
## # A tibble: 32,744 x 3
##    gutenberg_id text                                                title       
##           <int> <chr>                                               <chr>       
##  1          768 "WUTHERING HEIGHTS"                                 Wuthering H…
##  2          768 ""                                                  Wuthering H…
##  3          768 ""                                                  Wuthering H…
##  4          768 "CHAPTER I"                                         Wuthering H…
##  5          768 ""                                                  Wuthering H…
##  6          768 ""                                                  Wuthering H…
##  7          768 "1801.--I have just returned from a visit to my la… Wuthering H…
##  8          768 "neighbour that I shall be troubled with.  This is… Wuthering H…
##  9          768 "country!  In all England, I do not believe that I… Wuthering H…
## 10          768 "situation so completely removed from the stir of … Wuthering H…
## # … with 32,734 more rows

Notice that the meta_fields argument allows us to add one or more additional fields from the gutenberg_metadata to the downloaded text, such as title or author.

books %>%
  count(title)
## # A tibble: 2 x 2
##   title                           n
##   <chr>                       <int>
## 1 Jane Eyre: An Autobiography 20659
## 2 Wuthering Heights           12085

Other meta-datasets

You may want to select books based on information other than their title or author, such as their genre or topic. gutenberg_subjects contains pairings of works with Library of Congress subjects and topics. “lcc” means Library of Congress Classification, while “lcsh” means Library of Congress subject headings:

gutenberg_subjects
## # A tibble: 140,173 x 3
##    gutenberg_id subject_type subject                                            
##           <int> <chr>        <chr>                                              
##  1            1 lcc          E201                                               
##  2            1 lcsh         United States. Declaration of Independence         
##  3            1 lcsh         United States -- History -- Revolution, 1775-1783 …
##  4            1 lcc          JK                                                 
##  5            2 lcc          KF                                                 
##  6            2 lcsh         Civil rights -- United States -- Sources           
##  7            2 lcsh         United States. Constitution. 1st-10th Amendments   
##  8            2 lcc          JK                                                 
##  9            3 lcsh         Presidents -- United States -- Inaugural addresses 
## 10            3 lcsh         United States -- Foreign relations -- 1961-1963    
## # … with 140,163 more rows

This is useful for extracting texts from a particular topic or genre, such as detective stories, or a particular character, such as Sherlock Holmes. The gutenberg_id column can then be used to download these texts or to link with other metadata.

gutenberg_subjects %>%
  filter(subject == "Detective and mystery stories")
## # A tibble: 521 x 3
##    gutenberg_id subject_type subject                      
##           <int> <chr>        <chr>                        
##  1          170 lcsh         Detective and mystery stories
##  2          173 lcsh         Detective and mystery stories
##  3          244 lcsh         Detective and mystery stories
##  4          305 lcsh         Detective and mystery stories
##  5          330 lcsh         Detective and mystery stories
##  6          481 lcsh         Detective and mystery stories
##  7          547 lcsh         Detective and mystery stories
##  8          863 lcsh         Detective and mystery stories
##  9          905 lcsh         Detective and mystery stories
## 10         1155 lcsh         Detective and mystery stories
## # … with 511 more rows
gutenberg_subjects %>%
  filter(grepl("Holmes, Sherlock", subject))
## # A tibble: 47 x 3
##    gutenberg_id subject_type subject                                           
##           <int> <chr>        <chr>                                             
##  1          108 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
##  2          221 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
##  3          244 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
##  4          834 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
##  5         1661 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
##  6         2097 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
##  7         2343 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
##  8         2344 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
##  9         2345 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
## 10         2346 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
## # … with 37 more rows

gutenberg_authors contains information about each author, such as aliases and birth/death year:

gutenberg_authors
## # A tibble: 16,236 x 7
##    gutenberg_author… author    alias  birthdate deathdate wikipedia   aliases   
##                <int> <chr>     <chr>      <int>     <int> <chr>       <chr>     
##  1                 1 United S… <NA>          NA        NA <NA>        <NA>      
##  2                 3 Lincoln,… <NA>        1809      1865 http://en.… United St…
##  3                 4 Henry, P… <NA>        1736      1799 http://en.… <NA>      
##  4                 5 Adam, Pa… <NA>          NA        NA <NA>        <NA>      
##  5                 7 Carroll,… Dodgs…      1832      1898 http://en.… <NA>      
##  6                 8 United S… <NA>          NA        NA <NA>        Agency, U…
##  7                 9 Melville… Melvi…      1819      1891 http://en.… <NA>      
##  8                10 Barrie, … Barri…      1860      1937 http://en.… <NA>      
##  9                12 Smith, J… Smith…      1805      1844 http://en.… <NA>      
## 10                14 Madison,… Unite…      1751      1836 http://en.… <NA>      
## # … with 16,226 more rows

Analysis

What’s next after retrieving a book’s text? Well, having the book as a data frame is especially useful for working with the tidytext package for text analysis.

library(tidytext)

words <- books %>%
  unnest_tokens(word, text)

words
## # A tibble: 305,532 x 3
##    gutenberg_id title             word     
##           <int> <chr>             <chr>    
##  1          768 Wuthering Heights wuthering
##  2          768 Wuthering Heights heights  
##  3          768 Wuthering Heights chapter  
##  4          768 Wuthering Heights i        
##  5          768 Wuthering Heights 1801     
##  6          768 Wuthering Heights i        
##  7          768 Wuthering Heights have     
##  8          768 Wuthering Heights just     
##  9          768 Wuthering Heights returned 
## 10          768 Wuthering Heights from     
## # … with 305,522 more rows
word_counts <- words %>%
  anti_join(stop_words, by = "word") %>%
  count(title, word, sort = TRUE)

word_counts
## # A tibble: 21,200 x 3
##    title                       word           n
##    <chr>                       <chr>      <int>
##  1 Wuthering Heights           heathcliff   421
##  2 Wuthering Heights           linton       346
##  3 Jane Eyre: An Autobiography jane         342
##  4 Wuthering Heights           catherine    336
##  5 Jane Eyre: An Autobiography rochester    317
##  6 Jane Eyre: An Autobiography sir          315
##  7 Jane Eyre: An Autobiography miss         310
##  8 Jane Eyre: An Autobiography time         244
##  9 Jane Eyre: An Autobiography day          232
## 10 Jane Eyre: An Autobiography looked       221
## # … with 21,190 more rows

You may also find these resources useful: