Characteristic-​​based clustering for time series data

Published on 16 May 2006 in Refereed papers

Data Min­ing and Know­ledge Dis­cov­ery, 13(3), 335–364.

Xiaozhe Wang1, Kate A. Smith1 and Rob J. Hyndman2

  1. Fac­ulty of Inform­a­tion Tech­no­logy, Mon­ash Uni­ver­sity, Clayton VIC 3800, Australia.
  2. Depart­ment of Eco­no­met­rics and Busi­ness Stat­ist­ics, Mon­ash Uni­ver­sity, VIC 3800, Australia.

Abstract: With the grow­ing import­ance of time series clus­ter­ing research, par­tic­u­larly for sim­il­ar­ity searches amongst long time series such as those arising in medi­cine or fin­ance, it is crit­ical for us to find a way to resolve the out­stand­ing prob­lems that make most clus­ter­ing meth­ods imprac­tical under cer­tain cir­cum­stances. When the time series is very long, some clus­ter­ing algorithms may fail because the very nota­tion of sim­il­ar­ity is dubi­ous in high dimen­sion space; many meth­ods can­not handle miss­ing data when the clus­ter­ing is based on a dis­tance met­ric. This paper pro­poses a method for clus­ter­ing of time series based on their struc­tural char­ac­ter­ist­ics. Unlike other altern­at­ives, this method does not cluster point val­ues using a dis­tance met­ric, rather it clusters based on global fea­tures extrac­ted from the time series. The fea­ture meas­ures are obtained from each indi­vidual series and can be fed into arbit­rary clus­ter­ing algorithms, includ­ing an unsu­per­vised neural net­work algorithm, self-​​organizing map, or hier­archal clus­ter­ing algorithm. Global meas­ures describ­ing the time series are obtained by apply­ing stat­ist­ical oper­a­tions that best cap­ture the under­ly­ing char­ac­ter­ist­ics: trend, sea­son­al­ity, peri­od­icity, serial cor­rel­a­tion, skew­ness, kur­tosis, chaos, non­lin­ear­ity, and self-​​similarity. Since the method clusters using extrac­ted global meas­ures, it reduces the dimen­sion­al­ity of the time series and is much less sens­it­ive to miss­ing or noisy data. We fur­ther provide a search mech­an­ism to find the best selec­tion from the fea­ture set that should be used as the clus­ter­ing inputs. The pro­posed tech­nique has been tested using bench­mark time series data­sets pre­vi­ously repor­ted for time series clus­ter­ing and a set of time series data­sets with known char­ac­ter­ist­ics. The empir­ical res­ults show that our approach is able to yield mean­ing­ful clusters. The res­ult­ing clusters are sim­ilar to those pro­duced by other meth­ods, but with some prom­ising and inter­est­ing vari­ations that can be intu­it­ively explained with know­ledge of the global char­ac­ter­ist­ics of the time series.

Keywords: time series clus­ter­ing, clus­ter­ing, global char­ac­ter­ist­ics, fea­ture meas­ures, dimen­sion­al­ity reduction.

Online art­icle