Histogram representation of a large set of data is a good way for summarizing and visualize data and is frequently performed in order to optimize query estimation in DBMS. In this paper, we show the performance and the properties of two strategies for an optimal construction of histograms on a single real valued descriptor on the base of a prior choice of the number of buckets. The first one is based on the Fisher algorithm, while the second one is based on a geometrical procedure for the interpolation of the empirical distribution function by a piecewise linear function. The goodness of fit is computed using the Wasserstein metric between distributions. We compare the performances of the proposed methods against some existing ones on artificial and real datasets.
|Titolo:||Optimal histogram representation of large data sets: Fisher vs piecewise linear approximations|
|Autori interni:||IRPINO, Antonio|
|Data di pubblicazione:||2007|
|Rivista:||REVUE DES NOUVELLES TECHNOLOGIES DE L'INFORMATION|
|Appare nelle tipologie:||1.1 Articolo in rivista|