Piggyback Data atop your GitHub Repository!

Carl Boettiger

2021-09-08

Why piggyback?

piggyback grew out of the needs of students both in my classroom and in my research group, who frequently need to work with data files somewhat larger than one can conveniently manage by committing directly to GitHub. As we frequently want to share and run code that depends on >50MB data files on each of our own machines, on continuous integration (i.e. travis), and on larger computational servers, data sharing quickly becomes a bottleneck.

GitHub allows repositories to attach files of up to 2 GB each to releases as a way to distribute large files associated with the project source code. There is no limit on the number of files or bandwidth to deliver them.

Installation

Install the latest release from CRAN using:

install.packages("piggyback")

You can install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("ropensci/piggyback")

Authentication

No authentication is required to download data from public GitHub repositories using piggyback. Nevertheless, piggyback recommends setting a token when possible to avoid rate limits. To upload data to any repository, or to download data from private repositories, you will need to authenticate first.

To do so, add your GitHub Token to an environmental variable, e.g. in a .Renviron file in your home directory or project directory (any private place you won’t upload), see usethis::edit_r_environ(). For one-off use you can also set your token from the R console using:

Sys.setenv(GITHUB_TOKEN="xxxxxx")

But try to avoid putting Sys.setenv() in any R scripts – remember, the goal here is to avoid writing your private token in any file that might be shared, even privately. For more help setting up a GitHub token, for the first time, see usethis::browse_github_pat().

Downloading data

Download the latest version or a specific version of the data:

library(piggyback)
pb_download("iris2.tsv.gz", 
            repo = "cboettig/piggyback-tests",
            tag = "v0.0.1",
            dest = tempdir())

Note: Whenever you are working from a location inside a git repository corresponding to your GitHub repo, you can simply omit the repo argument and it will be detected automatically. Likewise, if you omit the release tag, thepb_downloadwill simply pull data from most recent release (latest). Third, you can omittempdir()if you are using an RStudio Project (.Rprojfile) in your repository, and then the download location will be relative to Project root.tempdir()` is used throughout the examples only to meet CRAN policies and is unlikely to be the choice you actually want here.

Lastly, simply omit the file name to download all assets connected with a given release.

pb_download(repo = "cboettig/piggyback-tests",
            tag = "v0.0.1",
            dest = tempdir())

These defaults mean that in most cases, it is sufficient to simply call pb_download() without additional arguments to pull in any data associated with a project on a GitHub repo that is too large to commit to git directly.

pb_download() will skip the download of any file that already exists locally if the timestamp on the local copy is more recent than the timestamp on the GitHub copy. pb_download() also includes arguments to control the timestamp behavior, progress bar, whether existing files should be overwritten, or if any particular files should not be downloaded. See function documentation for details.

Sometimes it is preferable to have a URL from which the data can be read in directly, rather than downloading the data to a local file. For example, such a URL can be embedded directly into another R script, avoiding any dependence on piggyback (provided the repository is already public.) To get a list of URLs rather than actually downloading the files, use pb_download_url():

pb_download_url("data/mtcars.tsv.gz", 
                repo = "cboettig/piggyback-tests",  
                tag = "v0.0.1") 

Uploading data

If your GitHub repository doesn’t have any releases yet, piggyback will help you quickly create one. Create new releases to manage multiple versions of a given data file. While you can create releases as often as you like, making a new release is by no means necessary each time you upload a file. If maintaining old versions of the data is not useful, you can stick with a single release and upload all of your data there.

pb_new_release("cboettig/piggyback-tests", "v0.0.2")

Once we have at least one release available, we are ready to upload. By default, pb_upload will attach data to the latest release.

## We'll need some example data first.
## Pro tip: compress your tabular data to save space & speed upload/downloads
readr::write_tsv(mtcars, "mtcars.tsv.gz")

pb_upload("mtcars.tsv.gz", 
          repo = "cboettig/piggyback-tests", 
          tag = "v0.0.1")

Like pb_download(), pb_upload() will overwrite any file of the same name already attached to the release file by default, unless the timestamp the previously uploaded version is more recent. You can toggle these settings with overwrite=FALSE and use_timestamps=FALSE.

Additional convenience functions

List all files currently piggybacking on a given release. Omit the tag to see files on all releases.

pb_list(repo = "cboettig/piggyback-tests", 
        tag = "v0.0.1")

Delete a file from a release:

pb_delete(file = "mtcars.tsv.gz", 
          repo = "cboettig/piggyback-tests", 
          tag = "v0.0.1")

Note that this is irreversible unless you have a copy of the data elsewhere.

git-style tracking

piggyback can be used in a Git-LFS-like manner by tracking all files that match a particular pattern, typically a file extension such as *.tif or *.tar.gz frequently found on large binary data files associated with a project but too big to commit to git. Similarly, specific directories for data files can be tracked. pb_track() function takes such patterns and stores them in into a hidden config file, .pbattributes (just like .gitattributes in Git LFS, which you can also edit manually).

pb_track(c("*.tsv.gz", "*.tif", "*.zip"))
pb_track("data/*")

Adding a pattern with pb_track() will also automatically add that pattern to .gitignore, since these data files will be piggybacking on top of the repo rather than being version managed by git. You probably will want to check in the .pbattributes file to version control, just as you would a .gitattributes or .gitignore.

Once you have tracked certain file types, it is easy to push all such files up to GitHub by piping pb_track() %>% pb_upload(). pb_track() just returns file paths to all matching files. As usual, this can upload to a specific repository and tag or merely to the defaults.

library(magrittr)
pb_track() %>% pb_upload(repo = "cboettig/piggyback-tests", tag = "v0.0.1")

Similarly, you can download all current data assets of the latest or specified release by using pb_download() with no arguments.

Caching

To reduce API calls to GitHub, piggyback caches most calls with a timeout of 1 second by default. This avoids repeating identical requests to update it’s internal record of the repository data (releases, assets, timestamps, etc) during programmatic use. You can increase or decrease this delay by setting the environmental variable in seconds, e.g. Sys.setenv("piggyback_cache_duration"=10) for a longer delay or Sys.setenv("piggyback_cache_duration"=0) to disable caching.

Path names

GitHub assets attached to a release do not support file paths, and will convert most special characters (#, %, etc) to . or throw an error (e.g. for file names containing $, @, /). To preserve path information on uploading data, piggyback uses relative paths (relative to the working directory, or for pb_push() and pb_pull, relative to the project directory, see here::here()) in data file names, and encodes the system path delimiter as .2f (%2f is the HTML encoding of a literal /, but % cannot be used in asset names). piggyback functions will always show and use the decoded file names, e.g. data/mtcars.csv, but you’ll see data.2fmtcars.csv if you look at the release attachment on GitHub.

A Note on GitHub Releases vs Data Archiving

piggyback is not intended as a data archiving solution. Importantly, bear in mind that there is nothing special about multiple “versions” in releases, as far as data assets uploaded by piggyback are concerned. The data files piggyback attaches to a Release can be deleted or modified at any time – creating a new release to store data assets is the functional equivalent of just creating new directories v0.1, v0.2 to store your data. (GitHub Releases are always pinned to a particular git tag, so the code/git-managed contents associated with repo are more immutable, but remember our data assets just piggyback on top of the repo).

Permanent, published data should always be archived in a proper data repository with a DOI, such as zenodo.org. Zenodo can freely archive public research data files up to 50 GB in size, and data is strictly versioned (once released, a DOI always refers to the same version of the data, new releases are given new DOIs). piggyback is meant only to lower the friction of working with data during the research process. (e.g. provide data accessible to collaborators or continuous integration systems during research process, including for private repositories.)

What will GitHub think of this?

GitHub documentation at the time of writing endorses the use of attachments to releases as a solution for distributing large files as part of your project:

Of course, it will be up to GitHub to decide if this use of release attachments is acceptable in the long term.