3. Data Download

2022-01-12

This vignette shows how to download data from Dataverse using the dataverse package. We’ll focus on a Dataverse repository that contains supplemental files for Political Analysis Using R, which is stored at Harvard’s Dataverse Server https://dataverse.harvard.edu.

The Dataverse entry for this study is persistently retrievable by a “Digital Object Identifier (DOI)”: https://doi.org/10.7910/DVN/ARKOTI and the citation on the Dataverse Page includes a “Universal Numeric Fingerprint (UNF)”: UNF:6:+itU9hcUJ8I9E0Kqv8HWHg==, which provides a versioned, multi-file hash for the entire study, which contains 32 files.

The following examples will draw from the Harvard Dataverse, so it is convenient to set this as a default environment variable.

Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")

This is equivalent to setting server = "dataverse.harvard.edu" in every dataverse function each time. Note that if you set an environment variable like the above, that is necessary to make your code reproducible.

For downloading a public dataset, no API Key is needed.

Sys.setenv("DATAVERSE_KEY" = "")

Retrieving Metadata

We will download these files and examine them directly in R using the dataverse package. To begin, we need to loading the package and using the get_dataset() function to retrieve some basic metadata about the dataset:

library("dataverse")
library("tibble") # to see dataframes in tidyverse-form

The get_dataset() function lists all of the files in the dataset along with a considerable amount of metadata about each. (Recall that in Dataverse, dataset is a collection of files, not a single file.) We can see a quick glance at these files using:

dataset <- get_dataset("doi:10.7910/DVN/ARKOTI")
dataset$files[c("filename", "contentType")]

This shows that there are indeed 32 files, a mix of .R code files and tab- and comma-separated data files.

You can also retrieve more extensive metadata using dataset_metadata():

str(dataset_metadata("10.7910/DVN/ARKOTI"), 2)
## List of 3
##  $ displayName: chr "Citation Metadata"
##  $ name       : chr "citation"
##  $ fields     :'data.frame': 7 obs. of  4 variables:
##   ..$ typeName : chr [1:7] "title" "author" "datasetContact" "dsDescription" ...
##   ..$ multiple : logi [1:7] FALSE TRUE TRUE TRUE TRUE FALSE ...
##   ..$ typeClass: chr [1:7] "primitive" "compound" "compound" "compound" ...
##   ..$ value    :List of 7

Retrieving Plain-Text Data

Now we’ll get the corresponding data files. First, we retrieve a plain-text file like this dataset on electricity consumption by Wakiyama et al. (2014). Taking the file name and dataset DOI from this entry,

energy <- get_dataframe_by_name(
  "comprehensiveJapanEnergy.tab",
  "10.7910/DVN/ARKOTI")
## Downloading ingested version of data with readr::read_tsv. To download the original version and remove this message, set original = TRUE.
## Rows: 60 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): date
## dbl (9): time, dummy, temp, temp2, all, large, house, kepco, tepco
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(energy)
## # A tibble: 6 × 10
##    time date  dummy  temp temp2      all    large    house    kepco    tepco
##   <dbl> <chr> <dbl> <dbl> <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
## 1     1 8-Jan     0   5.9  34.8 95792389 35194957 26190714 13357735 26960899
## 2     2 8-Feb     0   5.5  30.3 95156901 35322031 24224097 13315027 27189705
## 3     3 8-Mar     0  10.7 114.  91034047 36474192 21391965 12805831 24495519
## 4     4 8-Apr     0  14.7 216.  84087552 34949622 18494473 11494328 23540356
## 5     5 8-May     0  18.5 342.  82742929 35417089 17923760 11589061 22848737
## 6     6 8-Jun     0  21.3 454.  82180013 36692291 15205229 11360771 22487441

These get_dataframe_* functions, introduced in v0.3.0, directly read in the data into a R environment through whatever R function supplied by .f. The default of the get_dataframe_* functions is to read in such data by readr::read_tsv(). The .f function can be modified to modify the read-in settings. For example, the following modification is a base-R equivalent to read in the ingested data.

library(readr)
energy <- get_dataframe_by_name(
  "comprehensiveJapanEnergy.tab",
  "10.7910/DVN/ARKOTI",
  .f = function(x) read.delim(x, sep = "\t"))

head(energy)
##   time  date dummy temp temp2      all    large    house    kepco    tepco
## 1    1 8-Jan     0  5.9  34.8 95792389 35194957 26190714 13357735 26960899
## 2    2 8-Feb     0  5.5  30.3 95156901 35322031 24224097 13315027 27189705
## 3    3 8-Mar     0 10.7 114.5 91034047 36474192 21391965 12805831 24495519
## 4    4 8-Apr     0 14.7 216.1 84087552 34949622 18494473 11494328 23540356
## 5    5 8-May     0 18.5 342.3 82742929 35417089 17923760 11589061 22848737
## 6    6 8-Jun     0 21.3 453.7 82180013 36692291 15205229 11360771 22487441

Retrieving Custom Data Fromats (RDS, Stata, SPSS)

If a file is displayed on dataverse as a .tab file like the survey data by Alvarez et al. (2013), it is likely that Dataverse ingested the file to a plain-text, tab-delimited format.

argentina_tab <- get_dataframe_by_name(
  "alpl2013.tab",
  "10.7910/DVN/ARKOTI")

However, ingested files may not retain important dataset attributes. For example, Stata and SPSS datasets encode value labels on to numeric values. Factor variables in R dataframes encode levels, not only labels. A plain-text ingested file will discard such information. For example, the polling_place variable in this data is only given by numbers, although the original data labelled these numbers with informative values.

str(argentina_tab$polling_place)
##  num [1:1475] 31 31 31 31 31 31 31 31 31 31 ...

When ingesting, Dataverse retains a original version that retains these attributes but may not be readable in some platforms. The get_dataframe_* functions have an argument that can be set to original = TRUE. In this case we know that alpl2013.tab was originally a Stata dta file, so we can run:

argentina_dta <- get_dataframe_by_name(
  "alpl2013.tab",
  "10.7910/DVN/ARKOTI",
  original = TRUE,
  .f = haven::read_dta)

Now we see that labels are read in through haven’s labelled variables class:

str(argentina_dta$polling_place)
##  dbl+lbl [1:1475] 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 3...
##  @ label       : chr "polling_place"
##  @ format.stata: chr "%9.0g"
##  @ labels      : Named num [1:37] 1 2 3 4 5 6 7 8 9 10 ...
##   ..- attr(*, "names")= chr [1:37] "E.E.T." "Escuela Juan Bautista Alberdi" "Escuela Juan Carlos Dávalos" "Escuela Bernardino de Rivadavia" ...

Users should pick .f and original based on their existing knowledge of the file. If the original file is a .sav SPSS file, .f can be haven::read_sav. If it is a .Rds file, use readRDS or readr::read_rds. In fact, because the raw data is read in as a binary, there is no limitation to the file types get_dataframe_* can read in, as far as the dataverse package is concerned.

There are two more ways to read in a dataframe other than get_dataframe_by_name(). get_dataframe_by_doi() takes in a file-specific DOI if Dataverse contains one such as https://doi.org/10.7910/DVN/ARKOTI/IJPVOI. This removes the necessity for users to set the dataset argument. get_dataframe_by_id() takes a numeric Dataverse identification number. This identifier is an internal number and is not prominently featured in the interface.

In addition to visual inspection, we can compare the UNF signatures for each dataset against what is reported by Dataverse to confirm that we received the correct files. (See the UNF package and unf function).

Retrieving Scripts and other files

If the file you want to retrieve is not data, you may want to use the more primitive function, get_file, which gets the file data as a raw binary file. See the help page examples of get_file() that use the base::writeBin() function for details on how to write and read these binary files instead.

code3 <- get_file("chapter03.R", "doi:10.7910/DVN/ARKOTI")
writeBin(code3, "chapter03.R")