The R package fdcov (Functional Data Covariance) contains a collection of tools for performing statistical inference on functional data specifically through an analysis of the covariance structure of the data. It includes two methods for performing a k-sample test for equality of covariance in ksample.perm and ksample.com and two methods for performing a 2-sample test ksample.gauss and ksample.vstab. For supervised and unsupervised learning, it contains a method to classify functional data with respect to each category’s covariance operator in classif.com, and it contains a method to cluster functional data, cluster.com, again based on the covariance structure of the data.

The current version of this package assumes that all functional data is sampled on the same grid at the same intervals. Future updates are planned to allow for the below methods to interface with the fda package and its functional basis representations of the data.

1 k-sample test for the equality of covariance operators

There are two methods for performing a k-sample test for equality of covariance operators.

1.1 Concentration Inequality test

ksample.com uses concentration inequalities to test for equality of covariance operator in the non-asymptotic setting from Kashlak et al. (2017). We use a subset of the data contained in the phoneme dataset from the R package fds to illustrate how to use this function.

# Load fdcov package
library(fdcov)
# Load in phoneme data
library(fds)
# Setup data arrays
dat1 = rbind( t(aa$y)[1:20,], t(sh$y)[1:20,] );
dat2 = rbind( t(aa$y)[1:20,], t(ao$y)[1:20,] );
dat3 = rbind( dat1, t(ao$y)[1:20,] );
# Setup group labels
grp1 = gl(2,20);
grp2 = gl(2,20);
grp3 = gl(3,20);

This function takes as arguments a data matrix with one entry per row. The second argument contains the labels identifying which entries in the data matrix belong to which group.

# Compare two disimilar phonemes (should return TRUE)
ksample.com(dat1,grp1);

## [1] TRUE

# Compare two similar phonemes (should return FALSE)
ksample.com(dat2,grp2);

## [1] FALSE

# Compare three phonemes (should return TRUE)
ksample.com(dat3,grp3);

## [1] TRUE

A boolean variable is returned indicating whether or not the test believes the covariance operators differ significantly. This test is fast to compute, but inherently conservative. As a result, the two scale arguments can be used to tweak the test as detailed in Kashlak et al. (2017). The default test size is alpha = 0.05. This can also be modified as an argument to the above functions.

1.2 Permutation test

ksample.perm can be used to perform the metric-based permutation test for the equality of covariance operators of Cabassi et al. (2017). Again, we use a subset of the data contained in the phoneme dataset from the R package fds to illustrate how to use this function.

## Create data set
# Load data
data(aa); data(ao); data(dcl); data(iy); data(sh) 
# Select 20 observations from each dataset
dat = cbind(aa$y[,1:20],ao$y[,1:20],dcl$y[,1:20],iy$y[,1:20],sh$y[,1:20]) 
# Input matrix must be of size N (number of observations) X P (number of time points)
dat = t(dat)
# Define cluster labels
grp = c(rep(1,20),rep(2,20),rep(3,20),rep(4,20),rep(5,20))

The function takes as input the dataset dat and the vector of group labels grp.

## Test the equality of the covariance operators
p = ksample.perm(dat, grp, part = TRUE)

The user can also choose the number of iterations of the permutation test iter (the default is 1000), the distance between covariance operators dist (the default is sq, the square root distance) and the combining function to find the global p-value comb (the default is tipp, the Tippett combining function). If the data have already been centred around the mean, we suggest to set cent = TRUE. Moreover, if part=FALSE, this function returns only the global p-value. If part=TRUE, it returns a vector that contains both the global p-value (in p$global) and a list of partial p-values, one for each pairwise comparison (in p$partial).

p$global # global p-value

## [1] 0

p$partial # partial p-values

##         dist p_value signif
## 1-2 30.00486   0.000    ***
## 1-3 28.78225   0.000    ***
## 1-4 30.23026   0.075      .
## 1-5 30.05272   0.000    ***
## 2-3 26.96973   0.000    ***
## 2-4 29.17826   0.000    ***
## 2-5 27.55809   0.000    ***
## 3-4 28.35040   0.000    ***
## 3-5 26.81700   0.000    ***
## 4-5 29.97638   0.000    ***

The values of the observed distances between each pair of covariance operators are also reported in the table. The user can also choose to adjust the p-values with the step-down method of Westfall and Young (1993) by setting adj = TRUE in ksample.perm. For more information about these options please see and Cabassi et al. (2017).

The partial p-values can be plotted using the perm.plot function, that takes as input the list p, output by perm.plot, the number of groups and a vector of strings that indicate the class labels (optional):

perm.plot(p, 5, lab=c('aa','ao','dcl','iy','sh')) # visualise partial p-values

It is also possible to automatically save the figure by setting save = TRUE. The figure will be saved in .eps format. The name of the figure can be customised using the parameter name.

1.3 Two sample tests for Gaussian data

ksample.gauss and ksample.vstab are two methods detailed in Panaretos et al. (2010) that compare two samples of functional data for testing for equality of covariance operators. In this case, the data is assumed to be Gaussian. Considering the same phoneme data,

# Load in phoneme data
library(fds)
# Set up test data
dat1 = t(aa$y)[1:20,];
dat2 = t(sh$y)[1:20,];
dat3 = t(aa$y)[21:40,];

we can compare two sets of 20 observations corresponding to different phonemes and two sets of 20 observations from the same population of phonemes.

# Compare two disimilar phonemes
# Resulting in a small p-value
ksample.gauss(dat1,dat2,K=5);

## [1] 0.0002660378

ksample.vstab(dat1,dat2,K=5);

## [1] 1.1825e-08

# Compare two sets of the same phonemes 
# Resulting in a large p-value
ksample.gauss(dat1,dat3,K=5);

## [1] 0.5825128

ksample.vstab(dat1,dat3,K=5);

## [1] 0.4979127

In these examples, K=5 eigen-functions were used to reduce the data to a finite dimensional space. Then, a test statistic is constructed with asymptotic chi-squared distribution with K(K+1)/2 degrees of freedom. From there, the p-values are computed.

2 Classification

classif.com trains a covariance operator based functional data classifier that makes use of concentration inequalities in the same way as the above ksample.com function. predict.classif.com expands upon the generic predict function for the classif.com class. It uses the previously trained classifier to classify new observations.

library(fds);
 # Setup training data
dat1 = rbind( 
  t(aa$y[,1:100]), t(ao$y[,1:100]), t(dcl$y[,1:100]), 
  t(iy$y[,1:100]), t(sh$y[,1:100]) 
);
 # Setup testing data
dat2 = rbind( 
  t(aa$y[,101:400]), t(ao$y[,101:400]), t(dcl$y[,101:400]), 
  t(iy$y[,101:400]), t(sh$y[,101:400]) 
);

datgrp = gl(5,100);

Once the data has been read in, the classifier can be trained.

clCom = classif.com( datgrp, dat1 );

Subsequently, the class associated with new observations can be predicted based on the trained classifier.

grp = predict( clCom, dat2, LOADING=FALSE );
acc = c(
  sum( grp[1:300]==1 ), sum( grp[301:600]==2 ), sum( grp[601:900]==3 ), 
  sum( grp[901:1200]==4 ), sum( grp[1201:1500]==5 )
)/300;
print(rbind(gl(5,1),signif(acc,3)));

##       [,1]  [,2] [,3]  [,4] [,5]
## [1,] 1.000 2.000 3.00 4.000 5.00
## [2,] 0.783 0.797 0.96 0.993 0.99

3 Clustering

cluster.com clusters sets of functional data via their covariance operators making use of an EM style algorithm with concentration inequalities. This method similarly is based on the same concentration paradigm as ksample.com and classif.com.

# Setup data to be clustered
dat  = rbind( t(aa$y[,1:20]),t(iy$y[,1:20]),t(sh$y[,1:20]) );

Given the unlabelled phoneme data, the following method will assign categories to each entry given the preselected number of categories in the argument grpCnt.

# Cluster data into three groups
clst = cluster.com(dat,grpCnt=3,PRINTLK = FALSE);
matrix(clst,3,20,byrow=TRUE);

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,]    1    1    1    1    1    1    1    1    1     1     1     1     1
## [2,]    2    2    2    2    2    2    2    2    2     3     2     2     2
## [3,]    3    3    3    3    3    3    3    3    3     3     3     3     3
##      [,14] [,15] [,16] [,17] [,18] [,19] [,20]
## [1,]     1     1     1     1     1     1     1
## [2,]     2     2     2     3     2     2     3
## [3,]     3     3     3     3     3     3     3

In this next example, groups of curves are clustered as units. That is, we begin with a sample of 120 observations in the variable dat. We next assign labels grouping each sequential set of four obervations as a single unit. In this way, the algorithm clusters based on 30 rank 4 empirical covariance operators instead of on 120 individual curves.

# cluster groups of curves
dat  = rbind( t(aa$y[,1:40]),t(iy$y[,1:40]),t(sh$y[,1:40]) );
lab  = gl(30,4);
# Cluster data into three groups
clst = cluster.com(dat,labl=lab,grpCnt=3,PRINTLK = FALSE);
matrix(clst,3,10,byrow=TRUE);

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    2    2    2    2    2    2    2    2    2     2
## [2,]    1    1    1    1    1    1    1    1    1     1
## [3,]    3    3    3    3    3    3    3    3    3     3

R package fdcov

Adam B. Kashlak, Alessandra Cabassi

2017-12-21