Space partitioning for scalable K-means
Pettinger, D. and Di Fatta, G. (2010) Space partitioning for scalable K-means. In: Draghici, S., Khoshgoftaar, T. M., Palade, V., Pedrycz, W., Wani, M. A. and Zhu, X. (eds.) The Ninth International Conference on Machine Learning and Applications (ICMLA 2010). IEEE Press, pp. 319-324. ISBN 9781424492114
Full text not archived in this repository.
To link to this article DOI: 10.1109/ICMLA.2010.46
K-Means is a popular clustering algorithm which adopts an iterative refinement procedure to determine data partitions and to compute their associated centres of mass, called centroids. The straightforward implementation of the algorithm is often referred to as `brute force' since it computes a proximity measure from each data point to each centroid at every iteration of the K-Means process. Efficient implementations of the K-Means algorithm have been predominantly based on multi-dimensional binary search trees (KD-Trees). A combination of an efficient data structure and geometrical constraints allow to reduce the number of distance computations required at each iteration. In this work we present a general space partitioning approach for improving the efficiency and the scalability of the K-Means algorithm. We propose to adopt approximate hierarchical clustering methods to generate binary space partitioning trees in contrast to KD-Trees. In the experimental analysis, we have tested the performance of the proposed Binary Space Partitioning K-Means (BSP-KM) when a divisive clustering algorithm is used. We have carried out extensive experimental tests to compare the proposed approach to the one based on KD-Trees (KD-KM) in a wide range of the parameters space. BSP-KM is more scalable than KDKM, while keeping the deterministic nature of the `brute force' algorithm. In particular, the proposed space partitioning approach has shown to overcome the well-known limitation of KD-Trees in high-dimensional spaces and can also be adopted to improve the efficiency of other algorithms in which KD-Trees have been used.