Working to get textual data converted into numerical can be done in
many different ways. The steps included in textrecipes
should hopefully give you the flexibility to perform most of your
desired text preprocessing tasks. This vignette will showcase examples
that combine multiple steps.
This vignette will not do any modeling with the processed text as its
purpose it to showcase the flexibility and modularity. Therefore the
only packages needed will be dplyr
, recipes
and textrecipes
. Examples will be performed on the
okc_text
data-set which is packaged with
textrecipes
.
library(dplyr)
library(recipes)
library(textrecipes)
library(modeldata)
data("okc_text")
Sometimes it is enough to know the counts of a handful of specific
words. This can be easily be achieved by using the arguments
custom_stopword_source
and keep = TRUE
in
step_stopwords
.
<- c("you", "i", "sad", "happy")
words
<- recipe(~ ., data = okc_text) %>%
okc_rec step_tokenize(essay0) %>%
step_stopwords(essay0, custom_stopword_source = words, keep = TRUE) %>%
step_tf(essay0)
<- okc_rec %>%
okc_obj prep()
bake(okc_obj, okc_text) %>%
select(starts_with("tf_essay0"))
#> # A tibble: 750 × 4
#> tf_essay0_happy tf_essay0_i tf_essay0_sad tf_essay0_you
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0 1 0 3
#> 2 0 1 0 0
#> 3 0 21 0 1
#> 4 1 5 0 0
#> 5 0 3 0 3
#> 6 0 8 0 0
#> 7 0 15 0 5
#> 8 0 7 0 0
#> 9 0 0 0 0
#> 10 0 14 0 1
#> # … with 740 more rows
You might know of certain words you don’t want included which isn’t a
part of the stop word list of choice. This can easily be done by
applying the step_stopwords
step twice, once for the stop
words and once for your special words.
<- c("was", "she's", "who", "had", "some", "same", "you", "most",
stopwords_list "it's", "they", "for", "i'll", "which", "shan't", "we're",
"such", "more", "with", "there's", "each")
<- c("sad", "happy")
words
<- recipe(~ ., data = okc_text) %>%
okc_rec step_tokenize(essay0) %>%
step_stopwords(essay0, custom_stopword_source = stopwords_list) %>%
step_stopwords(essay0, custom_stopword_source = words) %>%
step_tfidf(essay0)
<- okc_rec %>%
okc_obj prep()
bake(okc_obj, okc_text) %>%
select(starts_with("tfidf_essay0"))
#> # A tibble: 750 × 9,235
#> tfidf_essay0_0 tfidf_essay0_01 tfidf_essay0_0aare tfidf_essay0_0abilly
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0 0 0 0
#> 2 0 0 0 0
#> 3 0 0 0 0
#> 4 0 0 0 0
#> 5 0 0 0 0
#> 6 0 0 0 0
#> 7 0 0 0 0
#> 8 0 0 0 0
#> 9 0 0 0 0
#> 10 0 0 0 0
#> # … with 740 more rows, and 9,231 more variables:
#> # tfidf_essay0_0aboondocks <dbl>, tfidf_essay0_0abrothers <dbl>,
#> # tfidf_essay0_0aconfidential <dbl>, tfidf_essay0_0aconversation <dbl>,
#> # tfidf_essay0_0adebates <dbl>, tfidf_essay0_0afly <dbl>,
#> # tfidf_essay0_0afriends <dbl>, tfidf_essay0_0agiants <dbl>,
#> # tfidf_essay0_0ahop <dbl>, tfidf_essay0_0ahunters <dbl>,
#> # tfidf_essay0_0aking <dbl>, tfidf_essay0_0amovies <dbl>, …
Another thing one might want to look at is the use of different
letters in a certain text. For this we can use the built-in character
tokenizer and keep only the characters using the
step_stopwords
step.
<- recipe(~ ., data = okc_text) %>%
okc_rec step_tokenize(essay0, token = "characters") %>%
step_stopwords(essay0, custom_stopword_source = letters, keep = TRUE) %>%
step_tf(essay0)
<- okc_rec %>%
okc_obj prep()
bake(okc_obj, okc_text) %>%
select(starts_with("tf_essay0"))
#> # A tibble: 750 × 26
#> tf_essay0_a tf_essay0_b tf_essay0_c tf_essay0_d tf_essay0_e tf_essay0_f
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 80 32 22 25 79 13
#> 2 8 3 5 5 8 0
#> 3 127 30 36 59 148 36
#> 4 28 3 5 10 34 6
#> 5 19 9 6 6 34 9
#> 6 97 21 22 34 130 26
#> 7 110 25 32 46 146 23
#> 8 66 13 9 23 69 12
#> 9 1 0 0 0 1 0
#> 10 250 76 115 106 274 53
#> # … with 740 more rows, and 20 more variables: tf_essay0_g <dbl>,
#> # tf_essay0_h <dbl>, tf_essay0_i <dbl>, tf_essay0_j <dbl>, tf_essay0_k <dbl>,
#> # tf_essay0_l <dbl>, tf_essay0_m <dbl>, tf_essay0_n <dbl>, tf_essay0_o <dbl>,
#> # tf_essay0_p <dbl>, tf_essay0_q <dbl>, tf_essay0_r <dbl>, tf_essay0_s <dbl>,
#> # tf_essay0_t <dbl>, tf_essay0_u <dbl>, tf_essay0_v <dbl>, tf_essay0_w <dbl>,
#> # tf_essay0_x <dbl>, tf_essay0_y <dbl>, tf_essay0_z <dbl>
Sometimes fairly complicated computations. Here we would like the
term frequency inverse document frequency (TF-IDF) of the most common
500 ngrams done on stemmed tokens. It is quite a handful and would
seldom be included as a option in most other libraries. But the
modularity of textrecipes
makes this task fairly easy.
First we will tokenize according to words, then stemming those words.
We will then paste together the stemmed tokens using
step_untokenize
so we are back at string that we then
tokenize again but this time using the ngram tokenizers. Lastly just
filtering and tfidf as usual.
<- recipe(~ ., data = okc_text) %>%
okc_rec step_tokenize(essay0, token = "words") %>%
step_stem(essay0) %>%
step_untokenize(essay0) %>%
step_tokenize(essay0, token = "ngrams") %>%
step_tokenfilter(essay0, max_tokens = 500) %>%
step_tfidf(essay0)
<- okc_rec %>%
okc_obj prep()
bake(okc_obj, okc_text) %>%
select(starts_with("tfidf_essay0"))
#> # A tibble: 750 × 500
#> `tfidf_essay0_a a class` `tfidf_essay0_a …` `tfidf_essay0_…` `tfidf_essay0_…`
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0 0 0 0
#> 2 0 0 0 0
#> 3 0 0 0 0
#> 4 0 0 0 0
#> 5 0 0 0 0
#> 6 0 0 0 0
#> 7 0 0 0 0
#> 8 0 0 0 0
#> 9 0 0 0 0
#> 10 0 0 0 0
#> # … with 740 more rows, and 496 more variables: `tfidf_essay0_a br br` <dbl>,
#> # `tfidf_essay0_a class ilink` <dbl>, `tfidf_essay0_a coupl of` <dbl>,
#> # `tfidf_essay0_a few year` <dbl>, `tfidf_essay0_a good time` <dbl>,
#> # `tfidf_essay0_a i can` <dbl>, `tfidf_essay0_a laid back` <dbl>,
#> # `tfidf_essay0_a littl bit` <dbl>, `tfidf_essay0_a long a` <dbl>,
#> # `tfidf_essay0_a lot and` <dbl>, `tfidf_essay0_a lot of` <dbl>,
#> # `tfidf_essay0_a lover who` <dbl>, `tfidf_essay0_a man who` <dbl>, …