Value Labels

Minnesota Population Center

2020-07-21

Value labels in the ipumsr package

Integrated variables in IPUMS data often have value labels, which attach text labels to the values taken by a variable (for example the HEALTH variable has value labels: 1 = “Excellent”, 2 = “Very good”, etc.). The ipumsr package does import the labels, but not as factors, which may be how you were expecting them.

library(ipumsr)
ddi <- read_ipums_ddi(ipums_example("cps_00015.xml"))
cps <- read_ipums_micro(ddi, verbose = FALSE)

cps
#> # A tibble: 10,883 x 13
#>     YEAR SERIAL HWTSUPP STATEFIP ASECFLAG   MONTH PERNUM WTSUPP   AGE      EDUC
#>    <dbl>  <dbl>   <dbl> <int+lb> <int+lb> <int+l>  <dbl>  <dbl> <int> <int+lbl>
#>  1  2016  24138   3249. 55 [Wis~ 1 [ASEC] 3 [Mar~      1  3249.    54  73 [Hig~
#>  2  2016  24139   3154. 55 [Wis~ 1 [ASEC] 3 [Mar~      1  3154.    54  73 [Hig~
#>  3  2016  24139   3154. 55 [Wis~ 1 [ASEC] 3 [Mar~      2  3154.    52  73 [Hig~
#>  4  2016  24140   1652. 55 [Wis~ 1 [ASEC] 3 [Mar~      1  1652.    38  60 [Gra~
#>  5  2016  24140   1652. 55 [Wis~ 1 [ASEC] 3 [Mar~      2  1503.    15  10 [Gra~
#>  6  2016  24140   1652. 55 [Wis~ 1 [ASEC] 3 [Mar~      3  1652.    38  73 [Hig~
#>  7  2016  24141   3049. 55 [Wis~ 1 [ASEC] 3 [Mar~      1  3049.    85  30 [Gra~
#>  8  2016  24142   1637. 55 [Wis~ 1 [ASEC] 3 [Mar~      1  1637.    27 111 [Bac~
#>  9  2016  24142   1637. 55 [Wis~ 1 [ASEC] 3 [Mar~      2  1637.    27 111 [Bac~
#> 10  2016  24142   1637. 55 [Wis~ 1 [ASEC] 3 [Mar~      3  1887.     2   1 [NIU~
#> # ... with 10,873 more rows, and 3 more variables: INCTOT <dbl+lbl>,
#> #   HEALTH <int+lbl>, MIGRATE1 <int+lbl>

The first clue that some of the variables are labelled is the <dbl+lbl> that appear below STATEFIP, ASECFLAG and other variables. The tibble package prints the variable’s type information below the variable name, and this “+lbl” indicates that the variable uses the labelled type. The function is.labelled() will also tell you if a variable is labelled.

By default the ipumsr package now will attempt to show the value labels when printing out a data.frame (like the "[Wis~" on STATEFIP above, for Wisconsin). If your console supports colorful output, then the labels will be in a lighter gray than the values. You can turn off this printing behavior by setting the option options("ipumsr.show_pillar_labels" = FALSE))

However, as you can see, there often isn’t enough space to see the labels when printing out a full data.frame. So, there are a few options to see the labels directly:

# Printing the variable directly (or a subset)
head(cps$MONTH)
#> <labelled<integer>[6]>: Month
#> [1] 3 3 3 3 3 3
#> 
#> Labels:
#>  value     label
#>      1   January
#>      2  February
#>      3     March
#>      4     April
#>      5       May
#>      6      June
#>      7      July
#>      8    August
#>      9 September
#>     10   October
#>     11  November
#>     12  December

# Just get the labels
ipums_val_labels(cps$MONTH)
#> # A tibble: 12 x 2
#>      val lbl      
#>    <int> <chr>    
#>  1     1 January  
#>  2     2 February 
#>  3     3 March    
#>  4     4 April    
#>  5     5 May      
#>  6     6 June     
#>  7     7 July     
#>  8     8 August   
#>  9     9 September
#> 10    10 October  
#> 11    11 November 
#> 12    12 December

# or if you're working interactively you can use ipums_view
# ipums_view(ddi)

Why use the labelled class instead of base R’s factors?

The usual way to connect numeric data to labels in R is in factor variables. Though this data type is more native to R, and more widely supported by R code, it was designed for efficient calculations in linear models, not as a general purpose value labeling system and so is missing important features that the value labels provided by IPUMS require.

Factors only allow for integers to be mapped to a text label, and these integers have to be a count starting at 1. This doesn’t work for IPUMS data because often our variables have specific meanings for the codes. For example, the variable AGE uses the value to mean the actual age, but does have labels for age 0 and the top codes.

head(cps$AGE)
#> <labelled<integer>[6]>: Age
#> [1] 54 54 52 38 15 38
#> 
#> Labels:
#>  value               label
#>      0        Under 1 year
#>     90 90 (90+, 1988-2002)
#>     99                 99+

cps$AGE_FACTOR <- as_factor(cps$AGE)
head(cps$AGE_FACTOR)
#> [1] 54 54 52 38 15 38
#> 84 Levels: Under 1 year 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ... 99+

It may seem like the new AGE_FACTOR variable is okay, but it can be very confusing!

mean(cps$AGE)
#> [1] 35.0226

# mean(cps$AGE_FACTOR) # error because data is a factor not numeric

mean(as.numeric(cps$AGE_FACTOR)) # A common mistake
#> [1] 35.94836

# The "more" correct way, but NA because of the text labels
mean(as.numeric(as.character(cps$AGE_FACTOR)))
#> Warning in mean(as.numeric(as.character(cps$AGE_FACTOR))): NAs introduced by
#> coercion
#> [1] NA

Because the factor variable has to assign values starting at 1, but the AGE variable started at 0, most values were 1 higher than they should have been. Not all values are 1 higher though, because not all values exist in the data, so 85, 90, and 99 are 82, 83 and 84 respectively.

Other variables have special meanings behind certain codes. For example, often missing or NIU values are indicated in IPUMS by values starting with the number 9 that are offset from the typical values. R’s factors do not allow for this separation, so the missing codes will be harder to distinguish.

Factors also require that every value be labelled, which is not always true in IPUMS data. In the AGE variable, the only values with labels are 0, 90 and 99. For all other values, there is not additional label information.

Is the labelled class a panacea? (Hint: no)

Though the labelled class does express all of the meaning provided by IPUMS value labels into R, many R functions cannot use them or even actively remove them from the data.

ipums_val_labels(cps$HEALTH)
#> # A tibble: 5 x 2
#>     val lbl      
#>   <int> <chr>    
#> 1     1 Excellent
#> 2     2 Very good
#> 3     3 Good     
#> 4     4 Fair     
#> 5     5 Poor

HEALTH2 <- ifelse(cps$HEALTH > 3, 3, cps$HEALTH)
ipums_val_labels(HEALTH2)
#> # A tibble: 0 x 2
#> # ... with 2 variables: val <dbl>, lbl <chr>

Therefore, your first task when importing an IPUMS data set will usually be to convert the labelled values to other data structures. The bad news is that there’s no good automatic way to do this; a lot depends on how you plan to use the variables in your analysis and your preferences.

The good news is that the ipumsr package provides several functions to make this process easier.

I think it is easiest to learn them by seeing them in action, so see below for a workflow for bringing in the CPS example extract. For your reference, here is a list of the functions:

Workflow example using ipumsr functions

as_factor()

The HEALTH variable is structured just like a factor and so can be converted directly. The as_factor() function is the easiest way to do so.

ipums_val_labels(cps$HEALTH)
#> # A tibble: 5 x 2
#>     val lbl      
#>   <int> <chr>    
#> 1     1 Excellent
#> 2     2 Very good
#> 3     3 Good     
#> 4     4 Fair     
#> 5     5 Poor
cps$HEALTH <- as_factor(cps$HEALTH)

The ASECFLAG and MONTH variables can also be converted directly (these were included by default by the IPUMS extract engine, but aren’t useful here because this data set only has respondents from ASEC in March)

cps$ASECFLAG <- as_factor(cps$ASECFLAG)
cps$MONTH <- as_factor(cps$MONTH)

as_factor works on data.frames by converting every labelled variable to a factor. However, this can create confusing variables like AGE_FACTOR from above, so this isn’t the best thing to do right away.

zap_labels()

I may decide that for my analysis, the AGE variable is most useful as the numeric values. The zap_labels() function removes the labels.

cps$AGE <- zap_labels(cps$AGE)

The top-codes (which are only available on the CPS website) indicate that the value 80 actually indicates 80-84 and 85 indicates 85+, so another option would be to convert them to a factor with age ranges.

lbl_clean()

This extract only contains data from a few states and I don’t want the factor to have a level for the unused ones. The lbl_clean() function keeps only labels for values in the current data set and returns a labelled variable which we can convert to a factor.

ipums_val_labels(cps$STATEFIP)
#> # A tibble: 75 x 2
#>      val lbl                 
#>    <int> <chr>               
#>  1     1 Alabama             
#>  2     2 Alaska              
#>  3     4 Arizona             
#>  4     5 Arkansas            
#>  5     6 California          
#>  6     8 Colorado            
#>  7     9 Connecticut         
#>  8    10 Delaware            
#>  9    11 District of Columbia
#> 10    12 Florida             
#> # ... with 65 more rows
cps$STATEFIP <- lbl_clean(cps$STATEFIP)

ipums_val_labels(cps$STATEFIP)
#> # A tibble: 5 x 2
#>     val lbl         
#>   <int> <chr>       
#> 1    19 Iowa        
#> 2    27 Minnesota   
#> 3    38 North Dakota
#> 4    46 South Dakota
#> 5    55 Wisconsin
cps$STATEFIP <- as_factor(cps$STATEFIP)

lbl_na_if()

The INCTOT variable has 2 labelled values that are not actually incomes: 99999998 indicates “Missing” and 99999999 indicate “Not in Universe”. On the CPS website, the Universe tab indicates that the Universe for 2016 is respondents age 15+. Let’s say for my analysis, I can treat these values as missing.

The lbl_na_if() function takes the variable and a function that refers to .val and .lbl (the values and labels respectively) and returns an indicator of whether to set those values to NA and remove the label. You can also use the ~ notation from the purrr package to create succinct anonymous functions.

The .val and .lbl only refer to values that already have labels, they do not apply to unlabeled values. See the lbl_add() and lbl_add_vals() functions below for working with unlabeled values.

# Caution: R defaults to printing large numbers like 99999999 in rounded 
# exponential format (1e+08) but that's not how they are actually stored
ipums_val_labels(cps$INCTOT)
#> # A tibble: 2 x 2
#>        val lbl                      
#>      <dbl> <chr>                    
#> 1 99999998 Missing.                 
#> 2 99999999 N.I.U. (Not in Universe).

# All of these are equivalent
INCTOT1 <- lbl_na_if(cps$INCTOT, ~.val >= 99999990)
INCTOT2 <- lbl_na_if(cps$INCTOT, ~.lbl %in% c("Missing.", "N.I.U. (Not in Universe)."))
INCTOT3 <- lbl_na_if(cps$INCTOT, function(.val, .lbl) {
  is_missing <- .val == 99999998
  is_niu <- .lbl == "N.I.U. (Not in Universe)."
  return(is_missing | is_niu)
})

# Change to a factor in the original cps data.frame
cps$INCTOT <- lbl_na_if(cps$INCTOT, ~.val >= 9999990)
cps$INCTOT <- as_factor(cps$INCTOT)

lbl_collapse()

The EDUC variable provides an example of a common IPUMS practice of grouping categories together by the starting digits. For example, the value 10 indicates “Grades 1, 2, 3, or 4”, and 11 - “Grade 1”, 12 - “Grade 2”, etc.

Let’s say that I only care about those more general categories provided by the first 2 digits. The lbl_collapse() function allows me to provide a function that takes .val and .lbl and returns the value to assign it to. If that code is already used, then the all of the values assigned to it will get that label, otherwise the label of the smallest value is used. Just like with lbl_na_if(), the purrr-style compact syntax using ~ functions is supported.

ipums_val_labels(cps$EDUC)
#> # A tibble: 36 x 2
#>      val lbl                 
#>    <int> <chr>               
#>  1     0 NIU or no schooling 
#>  2     1 NIU or blank        
#>  3     2 None or preschool   
#>  4    10 Grades 1, 2, 3, or 4
#>  5    11 Grade 1             
#>  6    12 Grade 2             
#>  7    13 Grade 3             
#>  8    14 Grade 4             
#>  9    20 Grades 5 or 6       
#> 10    21 Grade 5             
#> # ... with 26 more rows
# %/% is integer division, which divides by the number but doesn't keep the remainder
cps$EDUC <- lbl_collapse(cps$EDUC, ~.val %/% 10)

ipums_val_labels(cps$EDUC)
#> # A tibble: 14 x 2
#>      val lbl                 
#>    <dbl> <chr>               
#>  1     0 NIU or no schooling 
#>  2     1 Grades 1, 2, 3, or 4
#>  3     2 Grades 5 or 6       
#>  4     3 Grades 7 or 8       
#>  5     4 Grade 9             
#>  6     5 Grade 10            
#>  7     6 Grade 11            
#>  8     7 Grade 12            
#>  9     8 1 year of college   
#> 10     9 2 years of college  
#> 11    10 3 years of college  
#> 12    11 4 years of college  
#> 13    12 5+ years of college 
#> 14    99 Missing/Unknown

cps$EDUC <- as_factor(cps$EDUC)

lbl_relabel()

Sometimes you may wish to move the labels into new categories. For example, the categories in MIGRATE1 may not quite map what I want to use in my analysis.

The lbl_relabel() function provides a more flexible way to group existing labelled values into new ones. It takes a two-sided formula, where the left-hand side is a label (defined with the lbl() function) and the right hand side is an expression that can use .val and .lbl to evaluate to a logical indicating which values should be assigned to this label.

ipums_val_labels(cps$MIGRATE1)
#> # A tibble: 8 x 2
#>     val lbl                                 
#>   <int> <chr>                               
#> 1     0 NIU                                 
#> 2     1 Same house                          
#> 3     2 Different house, place not reported 
#> 4     3 Moved within county                 
#> 5     4 Moved within state, different county
#> 6     5 Moved between states                
#> 7     6 Abroad                              
#> 8     9 Unknown

cps$MIGRATE1 <- lbl_relabel(
  cps$MIGRATE1,
  lbl(0, "NIU / Missing / Unknown") ~ .val %in% c(0, 2, 9),
  lbl(1, "Stayed in state") ~ .val %in% c(1, 3, 4)
)

ipums_val_labels(cps$MIGRATE1)
#> # A tibble: 4 x 2
#>     val lbl                    
#>   <dbl> <chr>                  
#> 1     0 NIU / Missing / Unknown
#> 2     1 Stayed in state        
#> 3     5 Moved between states   
#> 4     6 Abroad

cps$MIGRATE1 <- as_factor(cps$MIGRATE1)

lbl_add() and lbl_add_vals()

These functions allow you to create labels for values that aren’t already labelled. It’s harder to come up with real world examples of when these functions would be useful, but just in case you come across such a situation, here’s how they work.

x <- haven::labelled(
  c(100, 200, 105, 990, 999, 230),
  c(`Unknown` = 990, NIU = 999)
)

lbl_add(x, lbl(100, "$100"), lbl(105, "$105"), lbl(200, "$200"), lbl(230, "$230"))
#> <labelled<double>[6]>
#> [1] 100 200 105 990 999 230
#> 
#> Labels:
#>  value   label
#>    100    $100
#>    105    $105
#>    200    $200
#>    230    $230
#>    990 Unknown
#>    999     NIU
lbl_add_vals(x, ~paste0("$", .))
#> <labelled<double>[6]>
#> [1] 100 200 105 990 999 230
#> 
#> Labels:
#>  value   label
#>    100    $100
#>    105    $105
#>    200    $200
#>    230    $230
#>    990 Unknown
#>    999     NIU

Ready for analysis

And now, after converting all those labels to factors, I’m ready for analysis! If you think of any other helper functions that would be useful for dealing with labels please let us know by filing an issue on github.

More detail on how the lbl_* functions work

One implementation detail that may help you understand the lbl_* functions better is that the value labels are stored separately from the actual data. This can be important because it allows for values to exist in the data without labels (such as the non-special codes in the INCTOT variable from the example above) and also for value labels to exist even if they don’t exist in the data (like the STATEFIPS that we didn’t include in our extract).

The .val and .lbl pronouns that are usable in functions like lbl_na_if(), lbl_collapse() and lbl_relabel() only include these labelled values, not all values in the dataset. Though this makes many calculations simpler, because only the labelled values are considered, it can be confusing when you want to work with the unlabeled values.

For example, considering the INCTOT variable, if there are unlabeled values that you want to set to NA, you cannot use lbl_na_if() directly.

# Reload cps data so that INCTOT is a labelled class again
cps <- read_ipums_micro(ddi, verbose = FALSE)

# Try to set all values above 1000000 to NA
test1 <- lbl_na_if(cps$INCTOT, ~.val > 1000000)
test1 <- zap_labels(test1)
max(test1, na.rm = TRUE)
#> [1] 1230006
# Didn't work

Instead you should add the value labels with lbl_add_vals() (or you could use a function that doesn’t use the labels, such as dplyr::na_if())

test2 <- lbl_add_vals(cps$INCTOT)
test2 <- lbl_na_if(test2, ~.val > 1000000)
test2 <- zap_labels(test2)
max(test2, na.rm = TRUE)
#> [1] 990003

Other resources

The haven package vignette ‘semantics’ has some more details about the motivation and implementation of the labelled class. You can view it by running the command: vignette("semantics", package = "haven")

The labelled package provides other methods for manipulating value labels. It is not installed by ipumsr, but is available on CRAN via the following command: install.packages("labelled")

The questionr package includes great functions for exploring labelled variables. In particular, the functions describe, freq and lookfor all print out to console information about the variable using the value labels. It is also not installed by ipumsr, but can be installed from CRAN using: install.packages("questionr")

Finally, the foreign and prettyR packages don’t use the labelled class data structure from haven (which ipumsr uses), but do have very similar concepts for attaching value labels. Code designed for these packages could be adapted for use with the haven labelled class without too much difficulty.