Multivariate time series classification codes and data sets are online

Saturday, 03 November 2012 13:00

The codes and data sets for our paper Multivariate Time Series Classification with Learned Discretization are online. Please find the details by clicking the link.

S-MTS (Symbolic Multivariate Time Series) discretizes the observation space in a supervised manner to obtain the symbolic representation for classification. It is mostly implemented in R (uses the randomForest package) and C (time consuming for loops are in C). There is no explicit feature extraction, the features are learned into symbolic representation.

It can handle nominal (categorical) time series and missing values. It is multiclass (does not require training multiple models as in Support Vector Machines (SVM)). It scales well with number of features (variables) and the number of time series. 

A tree-based ensemble (Random forest) is used to learn the symbols. Two parameters are important: Alphabet size and number of trees to generate the symbolic representation. Since each tree is trained on random subsample of the instances and features, different views of the same time series are represented by the ensemble (has some connection to scale-space theory).

The codes of S-MTS are available on http://www.mustafabaydogan.com/files/viewcategory/14-multivariate-time-series-classification.html.

Please let me know if you have any questions by contacting me through the contact link in the menu above.

{jcomments on}

 

 

The presentation of TSPD is uploaded

Wednesday, 17 October 2012 14:16

The presentation of TSPD in the INFORMS'12 conference in Phoenix is uploaded. You can find it on http://www.mustafabaydogan.com/files/viewcategory/8-presentations.html

Please let me know if you have any questions!

{jcomments on}

 

TSPD codes are now available

Monday, 08 October 2012 10:50

The second study of my dissertation, Supervised Time Series Pattern Discovery through Local Importance (TS-PD), has been submitted to Knowledge and Information Systems. 

The codes and details are available on Supervised Time Series Pattern Discovery through Local Importance (TSPD).

Please contact me if you have any questions.

{jcomments on}

 

Our results are online

Tuesday, 02 October 2012 22:33

A webpage summarizing the error rates of the time series classifiers proposed in my dissertation is created.

For now, only TSBF results (as well as the competitors') are reported. TS-PD and S-MTS results are coming soon!

Click here.

{jcomments on}

 

Comparing classifiers

Wednesday, 25 July 2012 17:14

Finding datasets to compare classifiers has turned into a problem recently. Thanks to UCI repository for helping to a certain extent. But what if each paper uses its own experimentation strategy or the authors do not provide their code to enable fair comparison of the classifiers.  So here are two things:

1) If authors share their code, there is not a problem at all. Still parameter setting can be a headache. 

2) Then the best thing to do is to fix the experimentation strategy to evaluate the performance of a classifier. Salzberg's (1997) [a] proposal is a good one in that sense:


       - First divide the data set into k subsets for cross validation.

       - We then run a cross-validation as follows.

           (A) For each of the k subsets of the data set D, create a training set T = D - k.
           (B) Divide each training set into two smaller subsets, T1 and T2. T1 will be used for training, and T2 for tuning. The parameters of the algorithm is tuned based on the error rates on T2. This way, the experimenter is forced to be more explicit about what those parameters are.
           (C) Once the parameters are optimized, re-run training on the larger set T and measure the accuracy on subset k.
           (D) Overall accuracy is averaged across all k partitions. Also variance can be estimated using the error rates of k partitions.

This is not stated in the paper but reliable estimates can be obtained by replicating the cross validation n times.  An R code that illustrates the experimentation strategy is provided here.  This code provides a basic scenario on Iris dataset. Suppose we would like to train a random forest and we are interested in finding the best setting for the number of features to be tried at each split (mtry parameter for the function randomForest). The number of trees are fixed to 50.  There are 4 features for Iris and the factors of 0.2, 0,4, 0.6 tried (which makes 1,2 and 3 features respectively). 

[a] Salzberg, S. L. (1997): On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach. Data Mining and Knowledge Discovery 1:3

{jcomments on}

Performance improvements for TSBF

Thursday, 19 July 2012 01:28

While running TSBF on the new data from UCR database for our revisions to the paper, I realized that current R implementation is not efficient.  Overall approach is still not implemented in a good way since feature extraction is done separately (C code) where the connection with R is through text files. This affects the time to run TSBF significantly since reading files into matrices in R is taking substantial time (especially for large datasets). To shorten the time for reading feature matrices and handle the memory efficiently, I did the following revisions:

1)  Removing matrices that are not used (memory management) illustrated below for some of the matrices.

rm(subtr)
rm(subtst)
gc(verbose=TRUE)
 
2)  Reading subsequence features to a matrix using scan (improves the memory usage and computation time) .
 
Before (read.table reads to a data.frame which is not efficient memorywise if the data is numeric, use of matrix instead improves the memory usage and time to read):
#read generated features
subtr<- read.table("RFsub_train")
subtst<- read.table("RFsub_test")
 
After (added two lines of code to c implementation so that we know the number of subsequences per time series and number of columns of the feature matrix
#read subsequence data information and generated features
stats<-scan("stats",n=2,quiet=TRUE) #[1] number of subsequences [2]  number of features
nsub=stats[1]*noftrain
nfeat=stats[2]
nsubtest=stats[1]*noftest
subtr<-matrix(scan("RFsub_train",what=numeric(0),n=nsub*nfeat,quiet=TRUE),nsub,nfeat,byrow=TRUE)
subtst<-matrix(scan("RFsub_test",what=numeric(0),n=nsubtest*nfeat,quiet=TRUE),nsubtest,nfeat,byrow=TRUE)
 
Performance with and without scan on a Windows 7 system with i5 2.13 Ghz processor (feature matrix for subsequence features of CinC_ECG_torso dataset, matrix size: 407100 X 102):
system.time(subtst<- read.table("RFsub_test")) 
  user         system  elapsed 
1141.50       6.89    1169.18
system.time(matrix(scan("RFsub_test",what=numeric(0),n=nsubtest*nfeat,quiet=TRUE),nsubtest,nfeat,byrow=TRUE))
  user      system    elapsed 
116.93     2.48        121.59
 
Please let me know if you have any questions! The direct link to the folder for the updated files is here.
 

Page 3 of 3

«StartPrev123NextEnd»

Copyright © 2014 mustafa gokce baydogan

LinkedIn
Twitter
last.fm