Accessing the Wordbank database

Mika Braginsky

2020-11-13

The wordbankr package allows you to access data in the Wordbank database from R. This vignette shows some examples of how to use the data loading functions and what the resulting data look like.

There are three different data views that you can pull out of Wordbank: by-administration, by-item, and administration-by-item. Additionally, you can get metadata about the sources and instruments underlying the data. Advanced functionality let’s you get estimates of words’ age of acquisition and word mappings across languages.

Administrations

The get_administration_data() function gives by-administration information, either for a specific language and/or form or for all instruments.

get_administration_data(language = "English (American)", form = "WS")
## # A tibble: 5,520 x 15
##    data_id   age comprehension production language form  birth_order ethnicity
##      <dbl> <int>         <int>      <int> <chr>    <chr> <fct>       <fct>    
##  1  129242    27           497        497 English… WS    Fourth      Hispanic 
##  2  129243    21           369        369 English… WS    Second      White    
##  3  129244    26           190        190 English… WS    Fourth      White    
##  4  129245    27           264        264 English… WS    Second      White    
##  5  129246    19           159        159 English… WS    Second      Other    
##  6  129247    30           513        513 English… WS    Second      Other    
##  7  129248    25           444        444 English… WS    Second      Other    
##  8  129249    24           582        582 English… WS    Second      White    
##  9  129250    28           558        558 English… WS    Second      Black    
## 10  129251    18             7          7 English… WS    Fourth      Other    
## # … with 5,510 more rows, and 7 more variables: sex <fct>, zygosity <chr>,
## #   norming <lgl>, mom_ed <fct>, longitudinal <lgl>, source_name <chr>,
## #   license <chr>
get_administration_data()
## # A tibble: 82,055 x 15
##    data_id   age comprehension production language form  birth_order ethnicity
##      <dbl> <int>         <int>      <int> <chr>    <chr> <fct>       <fct>    
##  1   29821    13           293         88 Croatian WG    <NA>        <NA>     
##  2   29822    16           122         12 Croatian WG    <NA>        <NA>     
##  3   29823     9             3          0 Croatian WG    <NA>        <NA>     
##  4   29824    12             0          0 Croatian WG    <NA>        <NA>     
##  5   29825    12            44          0 Croatian WG    <NA>        <NA>     
##  6   29826     8            14          5 Croatian WG    <NA>        <NA>     
##  7   29827     9             2          1 Croatian WG    <NA>        <NA>     
##  8   29828    10            44          1 Croatian WG    <NA>        <NA>     
##  9   29829    13           172         51 Croatian WG    <NA>        <NA>     
## 10   29830    16           241         68 Croatian WG    <NA>        <NA>     
## # … with 82,045 more rows, and 7 more variables: sex <fct>, zygosity <chr>,
## #   norming <lgl>, mom_ed <fct>, longitudinal <lgl>, source_name <chr>,
## #   license <chr>

Items

The get_item_data() function gives by-item information, either for a specific language and/or form or for all instruments.

get_item_data(language = "Italian", form = "WG")
## # A tibble: 505 x 11
##    item_id definition language form  type  category lexical_category
##    <chr>   <chr>      <chr>    <chr> <chr> <chr>    <chr>           
##  1 item_1  Risponde … Italian  WG    firs… <NA>     <NA>            
##  2 item_2  Risponde … Italian  WG    firs… <NA>     <NA>            
##  3 item_3  Reagisce … Italian  WG    firs… <NA>     <NA>            
##  4 item_4  Vuoi la p… Italian  WG    phra… <NA>     <NA>            
##  5 item_5  Hai sonno… Italian  WG    phra… <NA>     <NA>            
##  6 item_6  Vuoi bere? Italian  WG    phra… <NA>     <NA>            
##  7 item_7  Stai atte… Italian  WG    phra… <NA>     <NA>            
##  8 item_8  Stai buono Italian  WG    phra… <NA>     <NA>            
##  9 item_9  Batti le … Italian  WG    phra… <NA>     <NA>            
## 10 item_10 Cambiamo … Italian  WG    phra… <NA>     <NA>            
## # … with 495 more rows, and 4 more variables: lexical_class <chr>,
## #   uni_lemma <chr>, complexity_category <chr>, num_item_id <dbl>
get_item_data()
## # A tibble: 31,811 x 11
##    item_id definition language form  type  category lexical_category
##    <chr>   <chr>      <chr>    <chr> <chr> <chr>    <chr>           
##  1 item_81 gristi     Croatian WG    word  action_… predicates      
##  2 item_2… puhati     Croatian WG    word  action_… predicates      
##  3 item_2… razbiti    Croatian WG    word  action_… predicates      
##  4 item_64 donijeti   Croatian WG    word  action_… predicates      
##  5 item_1… kupiti     Croatian WG    word  action_… predicates      
##  6 item_36 čistiti    Croatian WG    word  action_… predicates      
##  7 item_3… zatvoriti  Croatian WG    word  action_… predicates      
##  8 item_2… plakati    Croatian WG    word  action_… predicates      
##  9 item_2… plesati    Croatian WG    word  action_… predicates      
## 10 item_42 crtati     Croatian WG    word  action_… predicates      
## # … with 31,801 more rows, and 4 more variables: lexical_class <chr>,
## #   uni_lemma <chr>, complexity_category <chr>, num_item_id <dbl>

Administrations x Items

If you are only looking at total vocabulary size, admins is all you need, since it has both productive and receptive vocabulary sizes calculated. If you are looking at specific items or subsets of items, you need to load instrument data, using the get_instrument_data() function. Pass it an instrument language and form, along with a list of items you want to extract (by item_id).

get_instrument_data(
  language = "English (American)",
  form = "WS",
  items = c("item_26", "item_46")
)
## # A tibble: 11,692 x 3
##    data_id value      num_item_id
##      <dbl> <chr>            <dbl>
##  1  129242 "produces"          26
##  2  129243 "produces"          26
##  3  129244 "produces"          26
##  4  129245 "produces"          26
##  5  129246 ""                  26
##  6  129247 "produces"          26
##  7  129248 "produces"          26
##  8  129249 "produces"          26
##  9  129250 "produces"          26
## 10  129251 ""                  26
## # … with 11,682 more rows

By default get_instrument_table() returns a data frame with columns of the administration’s data_id, the item’s num_item_id (numerical item_id), and the corresponding value. To include administration information, you can set the administrations argument to TRUE, or pass the result of get_administration_data() as administrations (that way you can prevent the administration data from being loaded multiple times). Similarly, you can set the iteminfo argument to TRUE, or pass it result of get_item_data().

Loading the data is fast if you need only a handful of items, but the time scales about linearly with the number of items, and can get quite slow if you need many or all of them. So, it’s a good idea to filter down to only the items you need before calling get_instrument_data().

As an example, let’s say we want to look at the production of animal words on English Words & Sentences over age. First we get the items we want:

animals <- get_item_data(language = "English (American)", form = "WS") %>%
  filter(category == "animals")

Then we get the instrument data for those items:

animal_data <- get_instrument_data(language = "English (American)",
                                   form = "WS",
                                   items = animals$item_id,
                                   administrations = TRUE)

Finally, we calculate how many animals words each child produces and the median number of animals of each age bin:

animal_summary <- animal_data %>%
  mutate(produces = value == "produces") %>%
  group_by(age, data_id) %>%
  summarise(num_animals = sum(produces, na.rm = TRUE)) %>%
  group_by(age) %>%
  summarise(median_num_animals = median(num_animals, na.rm = TRUE))
  
ggplot(animal_summary, aes(x = age, y = median_num_animals)) +
  geom_point() +
  labs(x = "Age (months)", y = "Median animal words producing")

Metadata

Instruments

The get_instruments() function gives information on all the CDI instruments in Wordbank.

get_instruments()
## # A tibble: 56 x 7
##    instrument_id language     form  age_min age_max has_grammar unilemma_covera…
##            <int> <chr>        <chr>   <int>   <int>       <int>            <dbl>
##  1             1 British Sig… WG          8      36           0            0.76 
##  2             2 Cantonese    WS         16      30           0            0.95 
##  3             3 Croatian     WG          8      16           0            1    
##  4             4 Croatian     WS         16      30           0            0.52 
##  5             5 Danish       WS         16      36           1            0.580
##  6             6 English (Am… WG          8      18           0            1    
##  7             7 English (Am… WS         16      30           1            1    
##  8             8 German       WS         18      30           0            0.77 
##  9             9 Hebrew       WG         11      25           0            1    
## 10            10 Hebrew       WS         25      36           1            0.86 
## # … with 46 more rows

Sources

The get_sources() function gives information on all the data sources in Wordbank, either for a specific language and/or form or for all instruments. If the admin_data argument is set to TRUE, the results will also include the number of administrations in the database from that source and the minimum and maximum ages of those administrations.

get_sources(form = "WG")
## # A tibble: 29 x 9
##    source_id name  dataset instrument_lang… instrument_form contributor citation
##        <int> <chr> <chr>   <chr>            <fct>           <chr>       <chr>   
##  1         9 Marc… "Normi… English (Americ… Words & Gestur… Larry Fens… "Fenson…
##  2        10 Byers ""      English (Americ… Words & Gestur… Krista Bye… ""      
##  3        11 Thal  "13"    English (Americ… Words & Gestur… Donna Thal… "Thal, …
##  4        12 Thal  "16"    English (Americ… Words & Gestur… Donna Thal… "Thal, …
##  5        14 Marc… "Normi… Spanish (Mexica… Words & Gestur… Donna Jack… "Jackso…
##  6        18 Kris… ""      Norwegian        Words & Gestur… Hanne Simo… "Simons…
##  7        19 Kris… "longi… Norwegian        Words & Gestur… Hanne Simo… "Simons…
##  8        20 CLEX  ""      Croatian         Words & Gestur… Melita Kov… "Kovace…
##  9        24 CLEX  ""      Russian          Words & Gestur… Stella Cey… "Е.А.Ве…
## 10        26 CLEX  ""      Swedish          Words & Gestur… Mårten Eri… "Erikss…
## # … with 19 more rows, and 2 more variables: longitudinal <lgl>, license <fct>
get_sources(language = "Spanish (Mexican)", admin_data = TRUE) %>%
  select(source_id, name, dataset, instrument_form, n_admins, age_min, age_max)
## # A tibble: 4 x 7
##   source_id name     dataset  instrument_form   n_admins age_min age_max
##       <int> <chr>    <chr>    <fct>                <int>   <int>   <int>
## 1        13 Marchman Norming  Words & Sentences     1094      15      30
## 2        14 Marchman Norming  Words & Gestures       778       8      19
## 3        65 Fernald  Outreach Words & Gestures        55      16      22
## 4        66 Fernald  Outreach Words & Sentences       80      18      38

Advanced functionality: Age of acquisition

The fit_aoa() function computes estimates of items’ age of acquisition (AoA). It needs to be provided with a data frame returned by get_instrument_data() – one row per administration x item combination, and minimally the columns age and num_item_id. It returns a data frame with one row per item and an aoa column with the estimate, preserving and item-level columns in the input data. The AoA is estimated by computing the proportion of administrations for which the child understands/produces (measure) each word, smoothing the proportion using method, and taking the age at which the smoothed value is greater than proportion.

eng_ws_data <- get_instrument_data(language = "English (American)",
                                   form = "WS",
                                   items = c("item_1", "item_42"),
                                   administrations = TRUE,
                                   iteminfo = TRUE)
fit_aoa(eng_ws_data)
## # A tibble: 2 x 10
## # Groups:   num_item_id [2]
##   num_item_id   aoa item_id definition type  category lexical_category
##         <dbl> <dbl> <chr>   <chr>      <chr> <chr>    <chr>           
## 1           1    NA item_1  baa baa    word  sounds   other           
## 2          42    24 item_42 owl        word  animals  nouns           
## # … with 3 more variables: lexical_class <chr>, uni_lemma <chr>,
## #   complexity_category <chr>
fit_aoa(eng_ws_data, measure = "understands", method = "glmrob", proportion = 0.7)
## # A tibble: 2 x 10
## # Groups:   num_item_id [2]
##   num_item_id   aoa item_id definition type  category lexical_category
##         <dbl> <dbl> <chr>   <chr>      <chr> <chr>    <chr>           
## 1           1    21 item_1  baa baa    word  sounds   other           
## 2          42    27 item_42 owl        word  animals  nouns           
## # … with 3 more variables: lexical_class <chr>, uni_lemma <chr>,
## #   complexity_category <chr>

Advanced functionality: Cross-linguistic data

One of the item-level fields is uni_lemma (“universal lemma”), which is intended to be an approximate semantic mapping between words across the languages in Wordbank. The function get_crossling_items() simply gives all the available uni_lemma values.

get_crossling_items()
## # A tibble: 1,380 x 1
##    uni_lemma      
##    <chr>          
##  1 a              
##  2 a little       
##  3 a lot          
##  4 able           
##  5 about          
##  6 above          
##  7 after          
##  8 afternoon      
##  9 again          
## 10 air conditioner
## # … with 1,370 more rows

The function get_crossling_data() takes a vector of uni_lemmas and returns a data frame of summary statistics for each item mapped to that uni_lemma in any language (on WG forms). Each row is combination of item and age, and the columns indicate the number of children (n_children), means (comprehension, production), standard deviations (comprehension_sd, production_sd), and item-level fields.

get_crossling_data(uni_lemmas = c("hat", "nose")) %>%
  ungroup() %>%
  select(language, uni_lemma, definition, age, n_children, comprehension,
         production, comprehension_sd, production_sd) %>%
  arrange(uni_lemma)
## # A tibble: 381 x 9
##    language uni_lemma definition   age n_children comprehension production
##    <chr>    <chr>     <chr>      <int>      <int>         <dbl>      <dbl>
##  1 British… hat       hat            8          4         0          0    
##  2 British… hat       hat            9          4         0          0    
##  3 British… hat       hat           10          4         0          0    
##  4 British… hat       hat           11          6         0.167      0    
##  5 British… hat       hat           12          6         0          0    
##  6 British… hat       hat           13          6         0          0    
##  7 British… hat       hat           14          7         0.143      0    
##  8 British… hat       hat           15          6         0          0    
##  9 British… hat       hat           16          7         0.143      0.143
## 10 British… hat       hat           17          7         0.286      0.143
## # … with 371 more rows, and 2 more variables: comprehension_sd <dbl>,
## #   production_sd <dbl>

Advanced functionality: Vocabulary quantiles

The function fit_vocab_quantiles() uses quantile regression to fit a set of vocabulary size quantiles to a dataset. It takes a data frame return by get_administration_data(), and additional arguments specifying which measure column to fit on (measure: “production” or “comprehension”), an optional demographic column to group by (group), and which type of quantiles to fit (quantiles: “standard”, “deciles”, “quintiles”, “quartiles”, “median”, or a numeric vector of quantile values). Defaults to “standard”, which is 0.10, 0.25, 0.50, 0.75, 0.90.

eng_ws <- get_administration_data("English (American)", "WS")
fit_vocab_quantiles(eng_ws, production)
## # A tibble: 75 x 5
## # Groups:   language, form [1]
##    language           form    age quantile production
##    <chr>              <chr> <int> <fct>         <dbl>
##  1 English (American) WS       16 0.1            8.84
##  2 English (American) WS       17 0.1           10.4 
##  3 English (American) WS       18 0.1           14.0 
##  4 English (American) WS       19 0.1           19.5 
##  5 English (American) WS       20 0.1           27.0 
##  6 English (American) WS       21 0.1           36.5 
##  7 English (American) WS       22 0.1           49.9 
##  8 English (American) WS       23 0.1           67.7 
##  9 English (American) WS       24 0.1           90.0 
## 10 English (American) WS       25 0.1          117.  
## # … with 65 more rows
fit_vocab_quantiles(eng_ws, production, sex)
## # A tibble: 150 x 6
## # Groups:   language, form, sex [2]
##    language           form  sex      age quantile production
##    <chr>              <chr> <fct>  <int> <fct>         <dbl>
##  1 English (American) WS    Female    16 0.1            8.06
##  2 English (American) WS    Female    17 0.1           10.6 
##  3 English (American) WS    Female    18 0.1           16.2 
##  4 English (American) WS    Female    19 0.1           25.  
##  5 English (American) WS    Female    20 0.1           36.9 
##  6 English (American) WS    Female    21 0.1           51.9 
##  7 English (American) WS    Female    22 0.1           70.8 
##  8 English (American) WS    Female    23 0.1           93.8 
##  9 English (American) WS    Female    24 0.1          121.  
## 10 English (American) WS    Female    25 0.1          152.  
## # … with 140 more rows
fit_vocab_quantiles(eng_ws, production, quantiles = "quartiles")
## # A tibble: 45 x 5
## # Groups:   language, form [1]
##    language           form    age quantile production
##    <chr>              <chr> <int> <fct>         <dbl>
##  1 English (American) WS       16 0.25           18.0
##  2 English (American) WS       17 0.25           21.8
##  3 English (American) WS       18 0.25           30.1
##  4 English (American) WS       19 0.25           43.0
##  5 English (American) WS       20 0.25           60.6
##  6 English (American) WS       21 0.25           82.7
##  7 English (American) WS       22 0.25          109. 
##  8 English (American) WS       23 0.25          141. 
##  9 English (American) WS       24 0.25          177. 
## 10 English (American) WS       25 0.25          217. 
## # … with 35 more rows