Updated TSPD codes are online

After submission of our paper named "Supervised Time Series Pattern Discovery through Local Importance" (TSPD) (supporting page), we made the codes available online. The functions used in TSPD are implemented as part of my recent R package called LPS. Source code for LPS is available here.

We illustrated the performance of TSPD on classification problems. R code for classification is provided in Files  section. This example uses GunPoint dataset from UCR Time Series Database. Here, I will go over the steps and explain how to run TSPD for classification.

1. Calling the package and setting the parameters: Assuming that you have installed packages LPS and randomForest, we call them using require function. We set the parameters as described in the paper. We use the same parameter setting for all datasets as mentioned. Comments (after #) clearly describe the correspondence between the variables and the parameters of TSPD.

require(LPS)
require(randomForest)

#Parameters: Corresponding parameter notation in the paper is provided
nrep=10		# number of replications in TSPD
treePerIter=50	# J=J_I=J_P number of trees
intmaxfrac=0.05	# I(max) maximum interval length as a fraction of TS length
maxshapefrac=0.25  # used to set L based on a fraction of TS length
kfrac=c(2,1,0.5,0.25)  # K (number of patterns) is set based on certain levels of number of training data (N)
ksteps=c(0,100,500,1000,10000)	# N levels for setting K (i.e. if N<100 K=kfrac[1]*N which is K=2N)

2. Organization of the training and test files, setting parameters that are based on training dataset characteristics: This consists of three main tasks. The files are read and we get the class information. The time series are standardized to zero mean and deviation of one to make the approaches comparable to DTW results provided by UCR Time Series Database.  Then we set the number of patterns and maximum possible interval length based on the number of training instances and time series length.

#read training data and characteristics
traindata=as.matrix(read.table("GunPoint_TRAIN"))
trainclass=traindata[,1]
noftrain=nrow(traindata)
traindata=t(apply(traindata[,2:ncol(traindata)], 1, function(x) (x-mean(x))/sd(x)))
nofclass=length(unique(trainclass))
lenseries=ncol(traindata)

#read test data and characteristics
testdata=as.matrix(read.table("GunPoint_TEST"))
noftest=nrow(testdata)
testclass=testdata[,1]
testdata=t(apply(testdata[,2:ncol(testdata)], 1, function(x) (x-mean(x))/sd(x)))
nofpattern=floor(kfrac[findInterval(noftrain,ksteps)]*noftrain) # setting K based on N
intmaxL=floor(lenseries*intmaxfrac) # maximum interval length for feature generation

3. Training: This consists of training RFint and RFpattern for nrep replications. Codes for one replication are given below. We select a random interval length between 5 and I(max) time units and train RFint on the interval representation. We sample patterns based on the local importance from RFint and compute best matching distances of time series to patterns. We then train RFpattern on this representation.

Initialize the matrices for storing predictions and data structure to store pattern information over replications
allvotes=matrix(0,noftest, nofclass)
allvotesOOB=matrix(0,noftrain, nofclass)
shapeletInfo=list(select=matrix(0,nofpattern,nrep),level=matrix(0,nofpattern,nrep)) 
Single replication of training TSPD
intlen=max(5,floor(runif(1)*intmaxL)+1) # select random interval length (w) between 5 and I(max)-> intmaxL
slidelen=floor(intlen/2) # set w=d/2 as described in the paper
maxInt=floor((lenseries*maxshapefrac)/(intlen))+1 # set K level

#train RFint
train=intervalFeatures(traindata,intlen,slidelen)
RFint <- randomForest(train$features,factor(trainclass),ntree=treePerIter,localImp=TRUE)
localimp=RFint$localImp

#train RFpattern
shapelet=shapeletSimilarity(traindata,localimp,train,maxInt,nshapelet=nofpattern)
RFpattern=randomForest(shapelet$similarity,factor(trainclass),ntree=treePerIter)
allvotesOOB=allvotesOOB+predict(RFpattern,type='vote')

shapeletInfo$select[,n]=shapelet$sel
shapeletInfo$level[,n]=shapelet$lev

4. Testing: Testing requires computation of best matching distances of test time series to patterns and classification by RFpattern. The voting results are aggregated using allvotes matrix (of dimensions noftest x nofclass). The largest vote determines the class for each time series. Codes for one replication of TSPD for testing is provided below.

Single replication of testing TSPD
test=shapeletSimilarityTest(testdata, traindata, shapelet$importanceOrder, shapeletInfo, train, n)
prediction=predictShapelet(RFpattern, test$similarity, whichTrees=c(1,treePerIter))
allvotes=allvotes+prediction$vote

A SAMPLE RUN RESULT

Screenshot of a sample run of TSPD on GunPoint dataset for 10 replications is provided below: (Ubuntu 12.10 system with 8 GB RAM, dual core CPU i7-3620M 2.7 GHz):

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Sample patterns found to be important by RFpattern
This output can be compared to the results from other shapelet studies. Simply a Google search on 'Gun-point shapelet' should return some relevant links. There is a good summary of the data sets and descriptions in the jmotif Google Code Homepage. The patterns discovered match with class descriptions.

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Copyright © 2014 mustafa gokce baydogan

LinkedIn
Twitter
last.fm