Dynamic load balancing in parallel KD-tree k-means
Di Fatta, G. and Pettinger, D. (2010) Dynamic load balancing in parallel KD-tree k-means. In: CIT '10: Proceedings of the 10th IEEE International Conference on Computer and Information Technology. IEEE, Washington DC, pp. 2478-2485. ISBN 978-0-7695-4108-2
To link to this article DOI: 10.1109/CIT.2010.424
One among the most influential and popular data mining methods is the k-Means algorithm for cluster analysis. Techniques for improving the efficiency of k-Means have been largely explored in two main directions. The amount of computation can be significantly reduced by adopting geometrical constraints and an efficient data structure, notably a multidimensional binary search tree (KD-Tree). These techniques allow to reduce the number of distance computations the algorithm performs at each iteration. A second direction is parallel processing, where data and computation loads are distributed over many processing nodes. However, little work has been done to provide a parallel formulation of the efficient sequential techniques based on KD-Trees. Such approaches are expected to have an irregular distribution of computation load and can suffer from load imbalance. This issue has so far limited the adoption of these efficient k-Means variants in parallel computing environments. In this work, we provide a parallel formulation of the KD-Tree based k-Means algorithm for distributed memory systems and address its load balancing issue. Three solutions have been developed and tested. Two approaches are based on a static partitioning of the data set and a third solution incorporates a dynamic load balancing policy.
 J. B. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proc. of the fifth Berkeley Symposium on Mathematical Statistics and Probability, L. M. L. Cam and J. Neyman, Eds., vol. 1. University of California Press, 1967, pp. 281–297.  S. P. Lloyd, “Least squares quantization in PCM,” IEEE Transactions on Information Theory, vol. IT-28, no. 2, pp. 129–137, Mar. 1982.  A. W. Moore, “Efficient memory based learning for robot control,” PhD thesis, 1991.  K. Alsabti, S. Ranka, and V. Singh, “An efficient K-Means clustering algorithm,” Proceedings of First Workshop on High Performance Data Mining, 1998.  D. Pelleg and A. Moore, “Accelerating exact K-Means algorithms with geometric reasoning,” Proceedings of Fifth ACM SIGKDD Intl. Conference on Knowledge Discovery and Data Mining, pp. 277–281, Jul. 1999.  T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu, “An efficient K-Means clustering algorithm: Analysis and implementation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 881–892, Jul. 2002.  I. S. Dhillon and D. S. Modha, “A data-clustering algorithm on distributed memory multiprocessors,” In Large-Scale Parallel Data Mining, Lecture Notes in Computer Science, vol. 1759, pp. 245–260, Mar. 2000.  A. G¨ursoy, “Data decomposition for K-Means parallel clustering,” in Proceedings of the 5th International Conference on Parallel Processing and Applied Mathematics (PPAM), 2003, pp. 241–248.  M. Inaba, N. Katoh, and H. Imai, “Applications of weighted Voronoi diagrams and randomization to variance-based kclustering,” in SCG ’94: Proceedings of the tenth annual symposium on Computational geometry. NewYork,NY, USA: ACM, 1994, pp. 332–339.  D. Judd, P. K. McKinley, and A. K. Jain, “Large-scale parallel data clustering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 8, pp. 871–876, Aug. 1998.  S. Har-Peled and B. Sadri, “How fast is the k-means method?” in SODA ’05: Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, 2005, pp. 877–885.  D. Arthur and S. Vassilvitskii, “How slow is the k-means method?” in SCG ’06: Proceedings of the twenty-second annual symposium on Computational geometry. NewYork, NY, USA: ACM, 2006, pp. 144–153.  P. S. Bradley and U. M. Fayyad, “Refining initial points for KMeans clustering,” Proceedings of Fifteenth Intl. Conference on Machine Learning, pp. 91–99, 1998.  D. Pelleg and A. Moore, “X-Means: Extending K-Means with efficient estimation of the number of clusters,” Proceedings of the Seventeenth Intl. Conference on Machine Learning, pp. 727–734, 2000.  J. L. Bentley, “Multidimensional binary search trees used for associative searching,” Communications of the ACM, vol. 18, no. 9, pp. 509–517, 1975.  T. Zhang, R. Ramakrishnan, and M. Livny, “Birch: an efficient data clustering method for very large databases,” SIGMOD Rec., vol. 25, no. 2, pp. 103–114, 1996.  D. Foti, D. Lipari, C. Pizzuti, and D. Talia, “Scalable parallel clustering for data mining on multicomputers,” Proceedings of the 15th IPDPS 2000 Workshops on Parallel and Distributed Processing, pp. 390–398, 2000.  I. Al-furaih, S. Aluru, S. Goil, and S. Ranka, “Parallel construction of multidimensional binary search trees,” IEEE Transactions on Parallel and Distributed Systems, vol. 11, no. 2, pp. 136–148, Feb. 2000.  J.B. Kruskal and M. Wish, Multidimensional Scaling. Sage University Paper series on Quantitative Applications in the Social Sciences, Beverly Hills and London: Sage Publications, 07-011, 1978.  M. Baker, B. Carpenter, and A. Shafi, “MPJ Express: Towards thread safe Java HPC,” Proceedings of the IEEE International Conference on Cluster Computing, Sep. 2006.