Introduction to dcmodify

Mark van der Loo and Edwin de Jonge

2021-09-24

A first statement

In the iris dataset, replace Sepal.Width with 4 value if it exceeds 4.

library(dcmodify)
library(magrittr)
iris %<>% modify_so( if(Sepal.Width > 4 ) Sepal.Width <- 4 )

Why this package

Data cleaning work flows or scripts typically contain a lot of ‘if this do that’ type of statements. Such statements are typically condensed expert knowledge. With this package, such ‘data modifying rules’ are taken out of the code and become instead parameters to the work flow. This allows you to maintain, document and reason about data modification rules separately from the flow of your programme.

This means you, the expert, can focus on the content and let R do the work.

Basic workflow

The workflow of dcmodify is designed to take two concerns of your hands. The first concern is how to implement the many ideas and rules that define how and when to modify data. The second concern is related to how to apply such rules to your data. We therefore introduce two nouns and one verb that govern the basic workflow.

Here’s an example using the retailers data set from the validate package.

data("retailers", package="validate")
head(retailers[-(1:2)],3)
##   staff turnover other.rev total.rev staff.costs total.costs profit vat
## 1    75       NA        NA      1130          NA       18915  20045  NA
## 2     9     1607        NA      1607         131        1544     63  NA
## 3    NA     6886       -33      6919         324        6493    426  NA

First we define a set of modifying rules, using modifier.

library(dcmodify)
m <- modifier(
  if (other.rev < 0) other.rev <- -1 * other.rev
  , if ( is.na(staff.costs) ) staff.costs <- mean(staff.costs)
)

Next, the rules can be applied to our data.

ret1 <- modify(retailers,m)

Alternatively, if you’re a fan of the magrittr, package you can do this

library(magrittr)
ret2 <- retailers %>% modifier(m)

or even

retailers %<>% modify_so(
  if ( other.rev < 0) other.rev <- -1 * other.rev
  , if ( is.na(staff.costs) ) staff.costs <- mean(staff.costs)
)

Here, the %<>% operator makes sure that the original dataset gets overwritten, and modify_so is a shortcut function for defining modificaton rules in-line.

Handling missing values

The rules you define in a modifier are executed on records where the conditions yields TRUE. In R this poses the problem on what to do when in a record the condition evaluates to NA. For example, the condition

other.rev < 0

in the first rule of m above evaluates to NA in the first record of the retailers dataset. Such cases are handled by treating it as if the condition evaluated to FALSE.

Exporting and importing rules from file

Modifier rules can also be defined and stored outside of the R script through the use of YAML files. Defining a YAML file can be done by hand, or by exporting an existing modifier object via export_yaml or as_yaml. Exporting the modifier defined in the Basic workflow section would look as follows:

export_yaml(m, "myrules.yaml")

This code will create a YAML file with the following content:

rules:
- expr: if (other.rev < 0) other.rev <- -1 * other.rev
  name: M1
  label: ''
  description: ''
  created: 2021-07-29 16:57:00
  origin: command-line
  meta: []
- expr: if (is.na(staff.costs)) staff.costs <- mean(staff.costs)
  name: M2
  label: ''
  description: ''
  created: 2021-07-29 16:57:00
  origin: command-line
  meta: []

Out of all these keys only rules: and expr: are required, all others are optional.

Once a YAML file is created, modifier can read the modification rules from the file and store it as a modifier object. For this the .file argument is used:

m <- modifier(.file = "myrules.yaml")

Using separate files for the storage of rules has the advantage that the same set of rules can be easily shared across many different scripts.

Performance, and a glimpse under the hood.

You, the user can assume that the rules are evaluated record-by-record. In reality, the package is smart enough to analyse the rules a little bit and to make sure they can be evaluated in a vectorized manner. This way explicit (and slow) R-loops are avoided as much as possible.

In short, when you call modify, or modify_so, the following steps are performed.

  1. The rules are transformed to statements that can be executed in a vectorized manner by R.
  2. If any macros present, they are inserted into the statements
  3. For each assignment, the conditions under which they should be executed are collected.
  4. The conditions are evaluated and assignments are executed on a selection of the data.

Difference with dplyr::mutate

The functionality of this package resembles dplyr::mutate, since it also allows one to specify data mutations on data frames (or other tabular data objects). The dplyr package is especially useful for interactive use and also for use in programming through ‘underscored’ functions such as mutate_.

The dcmodify package has been developed with a production street in mind where similar data sets are processed frequently. By taking the modifying rules out of the software, R programmers can build an application that allows users that are less knowledgeable about programming to specify their modification rules.

Logging changes

It can be interesting to study the effect of a certain set of data modifying rules. The lumberjack package is capable of tracking changes in data.

To start logging data you need to replace the magrittr pipe (%>%) with the lumberjack operator %>>% and insert some logging commands into the pipeline.

library(lumberjack)
# add primary key so cellwise changes can be traced
women$id <- letters[1:15]

out <- women %>>%
  start_log( cellwise$new(key="id") ) %>>%
  modify_so( if (height < mean(height)) height <- mean(height) ) %>>%
  dump_log()
## Dumped a log at cellwise.csv
# The log is written to file.
read.csv("cellwise.csv") %>>% head()
##   step                     time srcref
## 1    1 2021-09-24 11:29:10 CEST     NA
## 2    1 2021-09-24 11:29:10 CEST     NA
## 3    1 2021-09-24 11:29:10 CEST     NA
## 4    1 2021-09-24 11:29:10 CEST     NA
## 5    1 2021-09-24 11:29:10 CEST     NA
## 6    1 2021-09-24 11:29:10 CEST     NA
##                                                     expression key variable old
## 1 modify_so(if (height < mean(height)) height <- mean(height))   a   height  58
## 2 modify_so(if (height < mean(height)) height <- mean(height))   b   height  59
## 3 modify_so(if (height < mean(height)) height <- mean(height))   c   height  60
## 4 modify_so(if (height < mean(height)) height <- mean(height))   e   height  62
## 5 modify_so(if (height < mean(height)) height <- mean(height))   f   height  63
## 6 modify_so(if (height < mean(height)) height <- mean(height))   g   height  64
##   new
## 1  61
## 2  61
## 3  61
## 4  61
## 5  61
## 6  61

Current limitations

Conditional statements including else are not supported yet. Rules containing if() else are ignored with a warning.