Performance improvements for TSBF

While running TSBF on the new data from UCR database for our revisions to the paper, I realized that current R implementation is not efficient.  Overall approach is still not implemented in a good way since feature extraction is done separately (C code) where the connection with R is through text files. This affects the time to run TSBF significantly since reading files into matrices in R is taking substantial time (especially for large datasets). To shorten the time for reading feature matrices and handle the memory efficiently, I did the following revisions:

1)  Removing matrices that are not used (memory management) illustrated below for some of the matrices.

rm(subtr)
rm(subtst)
gc(verbose=TRUE)
 
2)  Reading subsequence features to a matrix using scan (improves the memory usage and computation time) .
 
Before (read.table reads to a data.frame which is not efficient memorywise if the data is numeric, use of matrix instead improves the memory usage and time to read):
#read generated features
subtr<- read.table("RFsub_train")
subtst<- read.table("RFsub_test")
 
After (added two lines of code to c implementation so that we know the number of subsequences per time series and number of columns of the feature matrix
#read subsequence data information and generated features
stats<-scan("stats",n=2,quiet=TRUE) #[1] number of subsequences [2]  number of features
nsub=stats[1]*noftrain
nfeat=stats[2]
nsubtest=stats[1]*noftest
subtr<-matrix(scan("RFsub_train",what=numeric(0),n=nsub*nfeat,quiet=TRUE),nsub,nfeat,byrow=TRUE)
subtst<-matrix(scan("RFsub_test",what=numeric(0),n=nsubtest*nfeat,quiet=TRUE),nsubtest,nfeat,byrow=TRUE)
 
Performance with and without scan on a Windows 7 system with i5 2.13 Ghz processor (feature matrix for subsequence features of CinC_ECG_torso dataset, matrix size: 407100 X 102):
system.time(subtst<- read.table("RFsub_test")) 
  user         system  elapsed 
1141.50       6.89    1169.18
system.time(matrix(scan("RFsub_test",what=numeric(0),n=nsubtest*nfeat,quiet=TRUE),nsubtest,nfeat,byrow=TRUE))
  user      system    elapsed 
116.93     2.48        121.59
 
Please let me know if you have any questions! The direct link to the folder for the updated files is here.

Copyright © 2014 mustafa gokce baydogan

LinkedIn
Twitter
last.fm