LPS package is online

This blog entry is outdated. Please check the R package on CRAN. Here is the link to package page. The manual provides all the necessary information about running LPS for univariate time series.

After submission of our paper named "Time series similarity based on a pattern-based representation" (supporting page), we made R package (LPS) online. It still requires significant amount of work in terms of documentation and that is actually why I cannot submit it to CRAN as in its current form. Hopefully, I will finish it soon. 

We illustrated the performance of the similarity measure on classification problems. R code for classification is provided in Files  section. This example uses GunPoint dataset from UCR Time Series Database. Here, I will go over the steps and explain how to run LPS for classification.

1. Calling the package and setting the parameters: The default settings of the parameters (parameters used in the paper) are set for functions implemented in the R package. The package functions are loaded with require() function. The segment length factors to be evaluated by the cross-validation is defined as an array named seglenfactor. Number of trees for learning patterns (denoted as J in the paper) is set to 150 (as in the paper). The path of the files should be provided. Here, the files are in my working directory.

require(LPS)

# parameters (L and J)
seglenfactor=c(0.25,0.5,0.75)
noftree=150

trainfile='GunPoint_TRAIN'
testfile='GunPoint_TEST'

2. Organization of the training and test files, setting replication parameters: This consists of three main tasks. The files are read and we get the class information. The time series are standardized to zero mean and deviation of one to make the approaches comparable to DTW results provided by UCR Time Series Database

#read train data
traindata_labeled <- as.matrix(read.table(trainfile,comment.char = ""))
class_train=traindata_labeled[,1] 
noftrain_labeled=nrow(traindata_labeled)

#standardize (if needed)
traindata_labeled=t(apply(traindata_labeled[,2:ncol(traindata_labeled)], 1, function(x) (x-mean(x))/sd(x)))

#read test data
testdata <- as.matrix(read.table(testfile,comment.char = ""))
nof_test=nrow(testdata)
class_test=testdata[,1]

#standardize (if needed)
testdata=t(apply(testdata[,2:ncol(testdata)], 1, function(x) (x-mean(x))/sd(x)))

3. Training: This consists of tuning the parameters and the learning of the patterns with the tree-based ensemble.  tuneLearnPattern() function is implemented for pattern learning. After tuning, the best segment length factor and the depth level is used for learning patterns with learnPattern(). Arguments of learnPattern() are almost the same as tuneLearnPattern() and they are described below:

tunelearnPattern <- function(x, y, unlabeledx=NULL, nfolds=5, segmentlevels=c(0.25,0.5,0.75), 
mindepth=4, maxdepth=8, depthstep=2, ntreeTry=25, diff.target=TRUE, diff.segment=TRUE,  ...) 
x: is the training data 
y: is the labels of the training data
unlabeledx: LPS may benefit from unlabeled data. This argument is created for future purposes.
nfolds: number of folds for cross-validation (default setting=5)
segmentlevels: segment levels to be evaluated (default setting c(0.25,0.5,0.75))
(mindepth, maxdepth, depthstep): determines the depth levels to be evaluated. (default setting evaluates 4,6,8)
ntreeTry: number of trees to be used by pattern learning for each fold (default setting=25)
diff.target: true if target can be a difference series, false otherwise (default setting=true)
diff.segment: true if predictor segment can be a difference series, false otherwise (default setting=true)
 
Tuning and pattern learning scripts are given below:
tune=tunelearnPattern(traindata_labeled, class_train, segmentlevels=seglenfactor, mindepth=4, maxdepth=8, depthstep=2, ntreeTry=25, target.diff=T,segment.diff=T)

# learn patterns
ensemble=learnPattern(traindata_labeled, segment_length_factor=tune$best.seg, target.diff=T, segment.diff=T, ntree=noftree, maxdepth=tune$best.depth, replace=FALSE)

The parameter "replace" is not clear in the description. If "replace" is set to true, the patterns are learned using the idea of bagging in random forests. Each tree selects a subset of the time series in this case. On the other hand, our approach requires maximum possible number of training instances to find better representations. Hence, this parameter is set to false. We kept this parameter to control the number of training instances which can significantly reduce the training time.

4. Testing: At this stage, time series are represented by learned patterns from each tree and the similarity is aggregated over the trees.

sim=matrix(0,nof_test,noftrain_labeled)
noftree=ensemble$forest$ntree
for(t in 1:noftree){
     representations=representTS(ensemble, traindata_labeled, testdata, which.tree=t, max_depth=tune$best.depth)
     sim=sim+computeSimilarity(representations$test, representations$train)
}

id=apply(sim,1,which.min)
predicted=class_train[id]
error_rate=1-sum(class_test==predicted)/nof_test

A SAMPLE RUN RESULT

Screenshot of a sample run of LPS on GunPoint dataset for 10 replications is provided below: (Ubuntu 12.10 system with 8 GB RAM, dual core CPU i7-3620M 2.7 GHz):

Learned Pattern Similarity (GunPoint)

Copyright © 2014 mustafa gokce baydogan

LinkedIn
Twitter
last.fm