Clustering time series using funtimes package

Srishti Vishwakarma

2022-02-22

1 Introduction

In this tutorial, two unsupervised clustering algorithms from the funtimes library are used to identify clusters of Australia’s sea level time series.

1.1 Loading libraries

First, load the essential libraries for the analysis:

library(funtimes)
library(ggplot2)
library(gridExtra)
library(readxl)
library(reshape2)

2 Data

The daily sea level data are available from 1993-2012 for 17 locations. The data are obtained from Maharaj, D’Urso, and Caido (2019) using the following link http://www.tsclustering.homepage.pt/index.php?p=3. Download Application7_3.zip folder, where the Aus Sea Levels 17.xlsx file contains the sea level records. Annual average is taken to convert the temporal resolution.

d_org <- readxl::read_xlsx("Aus_Sea_Levels_17.xlsx", skip = 1, n_max = 7300, col_types = rep("numeric", 20))
# yearly average
d <- data.frame(aggregate(d_org[, 4:20], list(d_org$Year), FUN = 'mean', na.rm = TRUE)[, -1], row.names = unique(d_org$Year))

2.1 Plotting time series

Below is the plot of annual time series of sea level for 17 locations:

dlong <- reshape2::melt(t(d))
names(dlong)[1:2] <- c("Location", "Year")
ggplot(dlong) + geom_line(aes(x = Year, y = value, color = Location), size = 1) +
  ylab('Sea level (m)') +
  theme_bw()

This plot demonstrates the variation in the sea levels across the locations. It can be seen that not all the time series are having a common trend since 1993. Grouping the locations with a common trend could benefit Australian government to assess and implement climate adaptation strategies for the impact of sea level rise on clustered locations.

3 Clustering time series based on trend synchronism

The first function from the package to test is the sync_cluster that groups the time series with the common linear trend. The window parameter w is set here for number of slides in each window. If the number of years are not enough in the time series, this parameter is required to be set.

set.seed(123)
Clus_sync <- sync_cluster(d ~ t, Window = 3, B = 100)
# [1] "Cluster labels:"
#  [1] 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 0
# [1] "Number of single-element clusters (labeled with '0'): 4"
Clus_sync
# $cluster
#  [1] 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 0
# 
# $elements
# $elements$`Time series that each formed a separate cluster`
# [1] "Broome"        "Cape.Fergusen" "Carnarvon"     "Wyndham"      
# 
# $elements$`1`
#  [1] "Booby.Island" "Brisbane"     "Burnie"       "Cocos.Island" "Darwin"      
#  [6] "Esperance"    "Freemantle"   "Hillarys"     "Portland"     "Spring.Bay"  
# [11] "Sydney"       "Thevenard"    "Townsville"  
# 
# 
# $estimate
# $estimate[[1]]
#              Estimate Std. Error   t value     Pr(>|t|)
# (Intercept) -1.230297  0.2652816 -4.637704 2.045787e-04
# t            2.343424  0.4429056  5.291023 4.966844e-05
# 
# 
# $pval
# $pval[[1]]
# [1] 0.06
# 
# 
# $statistic
# $statistic[[1]]
# Test statistic 
#      0.2646767 
# 
# 
# $ar_order
# $ar_order[[1]]
#          1  2  4 7  8 9 10 11 12 13 14 15 16
# ar.order 0 12 11 0 12 0  0  0  0  0  0 11  1
# 
# 
# $window_used
# $window_used[[1]]
#        1 2 4 7 8 9 10 11 12 13 14 15 16
# Window 3 3 3 3 3 3  3  3  3  3  3  3  3
# 
# 
# $all_considered_windows
# $all_considered_windows[[1]]
#  Window Statistic p-value Asympt. p-value
#       3 0.2646767    0.06       0.4608282
# 
# 
# $WAVK_obs
# $WAVK_obs[[1]]
#  [1]  0.212437377 -0.009172104 -0.011933680 -0.158881835 -0.017872773
#  [6] -0.017770762  0.199031778  0.060299520 -0.081576400 -0.100934598
# [11]  0.099766572 -0.007834631  0.099118234

Total 13 locations are clustered with a common linear trend, while the remaining four are not tied to any of the location with a common trend.

Below is the plot of clustered time series of sea level:

for(i in 0:max(Clus_sync$cluster)) {
  assign(paste('py', i, sep = ''),
         ggplot(melt(t(d[, Clus_sync$cluster == i]))) +
           geom_line(aes(x = Var2,y = value,color = Var1),size = 1) +
           ylab('Sea level (m)') + xlab('Year') +
           theme_bw() + ggtitle(paste('Cluster',i)) +
           theme(axis.text = element_text(size = 13), axis.title.x = element_text(size = 15),
                 axis.title.y = element_text(size = 15), legend.text = element_text(size = 10),
                 legend.title = element_blank(), legend.key.size = unit(0.3, "cm")))
}
grid.arrange(py0, py1)

Here, Cluster 0 indicates the locations which are not tied to the locations with a common linear trend, while Cluster 1 shows the time series of locations with a common linear trend.

4 Clustering time series using a spatiotemporal approach

The BICC function applies an unsupervised spatiotemporal clustering algorithm, TRUST, from Ciampi, Appice, and Malerba (2010). It utilizes both the CSlideCluster and CWindowCluster functions for clustering. There are few critical parameters to be considered before implementing the BICC function. The parameter p indicates the number of layers (i.e., number of observations in a time series) within each slide, while window w is the number consecutive slides. The size of slide and window are set manually. The slide s refers to the number of steps required to shift the window.

Clus_BICC <- BICC(as.matrix(d), p = 5, w = 4, s = 4)
Clus_BICC
# $delta.opt
# [1] 1.405918
# 
# $epsilon.opt
# [1] 0.25
# 
# $clusters
#          Booby.Island Brisbane Broome Burnie Cape.Fergusen Carnarvon
# Window_1            1        1      1      1             1         1
#          Cocos.Island Darwin Esperance Freemantle Hillarys Portland Spring.Bay
# Window_1            1      1         1          1        1        1          1
#          Sydney Thevenard Townsville Wyndham
# Window_1      1         1          1       1
# 
# $IC
#          [,1]     [,2]     [,3]     [,4]
# [1,] 244.1827 244.1827 244.1827 244.1827
# [2,] 244.1827 244.1827 244.1827 244.1827
# [3,] 244.1827 244.1827 244.1827 244.1827
# [4,] 244.1827 244.1827 244.1827 244.1827
# [5,] 244.1827 244.1827 244.1827 244.1827
# 
# $delta.all
# [1]  1.405918  8.435507 15.465096 22.494685 29.524274
# 
# $epsilon.all
# [1] 0.25 0.50 0.75 1.00

The algorithm detected only one cluster.

Citation

This vignette belongs to R package funtimes. If you wish to cite this page, please cite the package:

citation("funtimes")
# 
# To cite package 'funtimes' in publications use:
# 
#   Vyacheslav Lyubchich and Yulia R. Gel (2022). funtimes: Functions for
#   Time Series Analysis. R package version 8.2.
# 
# A BibTeX entry for LaTeX users is
# 
#   @Manual{,
#     title = {funtimes: Functions for Time Series Analysis},
#     author = {Vyacheslav Lyubchich and Yulia R. Gel},
#     year = {2022},
#     note = {R package version 8.2},
#   }

References

Ciampi, A., A. Appice, and D. Malerba. 2010. “Discovering Trend-Based Clusters in Spatially Distributed Data Streams.” In International Workshop of Mining Ubiquitous and Social Environments, 107–22. Barcelona, Spain.
Maharaj, E. A., P. D’Urso, and J. Caido. 2019. Time Series Clustering and Classification. 1st ed. Computer Science and Data Analysis. Australia: CRC Press. https://doi.org/10.1201/9780429058264.