Introduction to the surveydata package.

Andrie de Vries

2020-09-15

The surveydata package makes it easy to work with typical survey data that originated in SPSS or other formats.

Motivation

Specifically, the package makes it easy to include the question text as metadata with the data itself.

To track the questions of a survey, you have two options:

Neither of these options are ideal, since any subsetting of the survey data means you must keep track of the question metadata separately.

This package solves the problem by creating a new class, surveydata, and keeping the questions as an attribute of this class. Whenever you do a subset operation, the metadata stays intact.

In addition, the metadata knows if a question consists of a single column, or multiple columns. When creating a subset on the question name, the resulting object can be either a single column or multiple columns.

library(surveydata)
library(dplyr)
sv <- membersurvey %>% as.tbl()
## Warning: `as.tbl()` is deprecated as of dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Warning: 'as.tbl.surveydata' is deprecated.
## Use 'as_tibble' instead.
## See help("Deprecated")
sv
## # A tibble: 215 x 109
##       id  Q1_1  Q1_2 Q2    Q3_1  Q3_2  Q3_3  Q3_4  Q3_5  Q3_6  Q3_7  Q3_8  Q3_9 
##    <dbl> <dbl> <dbl> <ord> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct>
##  1     3     8   2   2009  No    No    No    No    No    No    No    No    No   
##  2     5    35  12   Befo~ Yes   No    No    No    No    No    No    No    Yes  
##  3     6    34  12   Befo~ Yes   Yes   No    No    No    Yes   No    No    No   
##  4    11    20   9   2010  No    No    No    No    No    No    No    No    No   
##  5    13    20   3   2010  No    No    No    No    No    No    No    No    No   
##  6    15    36  20   Befo~ No    Yes   No    No    No    No    No    No    Yes  
##  7    21    12   2.5 2009  Yes   No    No    No    No    Yes   Yes   No    No   
##  8    22    11   0.5 2011  Yes   Yes   Yes   Yes   Yes   No    No    No    No   
##  9    23    18   3   2008  Yes   Yes   Yes   Yes   Yes   Yes   No    No    Yes  
## 10    25    24   8   2006  No    No    No    Yes   Yes   Yes   No    No    Yes  
## # ... with 205 more rows, and 96 more variables: Q3_10 <fct>, Q3_11 <fct>,
...

Notice from this summary that Question 2 has two columns, i.e. Q2_1 and Q2_2. You can extract both these columns by simply referring to Q2:

sv[, "Q2"]
## # A tibble: 215 x 1
##    Q2         
##    <ord>      
##  1 2009       
##  2 Before 2002
##  3 Before 2002
##  4 2010       
##  5 2010       
##  6 Before 2002
##  7 2009       
##  8 2011       
##  9 2008       
## 10 2006       
## # ... with 205 more rows

However, the subset of Q1 returns only a single column:

sv[, "Q2"]
## # A tibble: 215 x 1
##    Q2         
##    <ord>      
##  1 2009       
##  2 Before 2002
##  3 Before 2002
##  4 2010       
##  5 2010       
##  6 Before 2002
##  7 2009       
##  8 2011       
##  9 2008       
## 10 2006       
## # ... with 205 more rows

Note that in both cases the surveydata object doesn’t return a vector - subsetting a surveydata object always returns a surveydata object.

About surveydata objects

A surveydata object consists of:

Data processing a survey file can be tricky, since the standard methods for dealing with data frames does not conserve the variable.labels attribute. The surveydata package defines a surveydata class and the following methods that knows how to deal with the variable.labels attribute:

In addition, surveydata defines the following convenient methods for extracting and working with the variable labels:

Defining a surveydata object

First load the surveydata package.

library(surveydata)

Next, create sample data. A data frame is the ideal data structure for survey data, and the convention is that data for each respondent is stored in the rows, while each column represents answers to a specific question.

sdat <- data.frame(
   id   = 1:4,
   Q1   = c("Yes", "No", "Yes", "Yes"),
   Q4_1 = c(1, 2, 1, 2), 
   Q4_2 = c(3, 4, 4, 3), 
   Q4_3 = c(5, 5, 6, 6), 
   Q10 = factor(c("Male", "Female", "Female", "Male")),
   crossbreak  = c("A", "A", "B", "B"), 
   weight      = c(0.9, 1.1, 0.8, 1.2)
)

The survey metadata consists of the questionnaire text. For example, this can be represented by a character vector, with an element for each question.

To assign this metadata to the survey data, use the varlabels() function. This function assigns the questionnaire text to the variable.labels attribute of the data frame.

varlabels(sdat) <- c(
   "RespID",
   "Question 1", 
   "Question 4: red", "Question 4: green", "Question 4: blue", 
   "Question 10",
   "crossbreak",
   "weight"
)

Finally, create the surveydata object. To do this, call the as.surveydata() function. The argument renameVarlabels controls whether the varlabels get renamed with the same names as the data. This is an essential step, and ensures that the question text remains in synch with the column names.

sv <- as.surveydata(sdat, renameVarlabels = TRUE)

Extracting specific questions

It is easy to extract specific questions with the [ operator. This works very similar to extraction of data frames. However, there are two important differences:

sv[, "Q1"]
##    Q1
## 1 Yes
## 2  No
## 3 Yes
## 4 Yes
sv[, "Q4"]
##   Q4_1 Q4_2 Q4_3
## 1    1    3    5
## 2    2    4    5
## 3    1    4    6
## 4    2    3    6

The extraction makes use of the underlying metadata, contained in the varlabels and pattern attributes:

varlabels(sv)
##                  id                  Q1                Q4_1                Q4_2 
##            "RespID"        "Question 1"   "Question 4: red" "Question 4: green" 
##                Q4_3                 Q10          crossbreak              weight 
##  "Question 4: blue"       "Question 10"        "crossbreak"            "weight"
pattern(sv)
## $sep
## [1] "_"
## 
## $exclude
## [1] "other"

Working with question columns

It is easy to query the surveydata object to find out which questions it contains, as well as which columns store the data for those questions.

questions(sv)
## [1] "id"         "Q1"         "Q4"         "Q10"        "crossbreak"
## [6] "weight"
which.q(sv, "Q1")
## [1] 2
which.q(sv, "Q4")
## [1] 3 4 5

Reading the questionnaire text

The function question_text() gives access to the questionnaire text.

question_text(sv, "Q1")
## [1] "Question 1"
question_text(sv, "Q4")
## [1] "Question 4: red"   "Question 4: green" "Question 4: blue"

Getting the common question text

Use question_text_common() to retrieve the common text, i.e. the question itself:

question_text_common(sv, "Q4")
## [1] "Question 4"

Getting the unique question text

And use question_text_unique() to retrieve the unique part of the question, i.e. the sub-questions:

question_text_unique(sv, "Q4")
## [1] "red"   "green" "blue"

Using surveydata with dplyr

The surveydata object knows how to deal with the following dplyr verbs:

In every case the resulting object will also be of class surveydata.

Summary

The surveydata object can make it much easier to work with survey data.