Dynamic load balancing in parallel KD-tree k-means

Di Fatta, Giuseppe; Pettinger, David

Download

[thumbnail of 2010_DiFatta10-IEEE-ScalCom.pdf]

Preview

Text
- Accepted Version

Advice

Please see our End User Agreement.

It is advisable to refer to the publisher's version if you intend to cite from this work. See Guidance on citing.

Tools

Lists

Di Fatta, G. and Pettinger, D. (2010) Dynamic load balancing in parallel KD-tree k-means. In: CIT '10: Proceedings of the 10th IEEE International Conference on Computer and Information Technology. IEEE, Washington DC, pp. 2478-2485. ISBN 978-0-7695-4108-2 doi: 10.1109/CIT.2010.424

Abstract/Summary

One among the most influential and popular data mining methods is the k-Means algorithm for cluster analysis. Techniques for improving the efficiency of k-Means have been largely explored in two main directions. The amount of computation can be significantly reduced by adopting geometrical constraints and an efficient data structure, notably a multidimensional binary search tree (KD-Tree). These techniques allow to reduce the number of distance computations the algorithm performs at each iteration. A second direction is parallel processing, where data and computation loads are distributed over many processing nodes. However, little work has been done to provide a parallel formulation of the efficient sequential techniques based on KD-Trees. Such approaches are expected to have an irregular distribution of computation load and can suffer from load imbalance. This issue has so far limited the adoption of these efficient k-Means variants in parallel computing environments. In this work, we provide a parallel formulation of the KD-Tree based k-Means algorithm for distributed memory systems and address its load balancing issue. Three solutions have been developed and tested. Two approaches are based on a static partitioning of the data set and a third solution incorporates a dynamic load balancing policy.

Altmetric Badge

Dimensions Badge

Item Type	Book or Report Section
URI	https://centaur.reading.ac.uk/id/eprint/6127
Identification Number/DOI	10.1109/CIT.2010.424
Refereed	Yes
Divisions	Science > School of Mathematical, Physical and Computational Sciences > Department of Computer Science
Uncontrolled Keywords	clustering, dynamic load balancing, KD Trees, parallel k-means Clustering;
Publisher	IEEE
Publisher Statement	(c) 2010 IEEE
Download/View statistics	View download statistics for this item

Download Statistics

Downloads

Downloads per month over past year

Related URLs

Deposit Details

References

[1] J. B. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proc. of the fifth Berkeley Symposium on Mathematical Statistics and Probability, L. M. L. Cam and J. Neyman, Eds., vol. 1. University of California Press, 1967, pp. 281–297. [2] S. P. Lloyd, “Least squares quantization in PCM,” IEEE Transactions on Information Theory, vol. IT-28, no. 2, pp. 129–137, Mar. 1982. [3] A. W. Moore, “Efficient memory based learning for robot control,” PhD thesis, 1991. [4] K. Alsabti, S. Ranka, and V. Singh, “An efficient K-Means clustering algorithm,” Proceedings of First Workshop on High Performance Data Mining, 1998. [5] D. Pelleg and A. Moore, “Accelerating exact K-Means algorithms with geometric reasoning,” Proceedings of Fifth ACM SIGKDD Intl. Conference on Knowledge Discovery and Data Mining, pp. 277–281, Jul. 1999. [6] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu, “An efficient K-Means clustering algorithm: Analysis and implementation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 881–892, Jul. 2002. [7] I. S. Dhillon and D. S. Modha, “A data-clustering algorithm on distributed memory multiprocessors,” In Large-Scale Parallel Data Mining, Lecture Notes in Computer Science, vol. 1759, pp. 245–260, Mar. 2000. [8] A. G¨ursoy, “Data decomposition for K-Means parallel clustering,” in Proceedings of the 5th International Conference on Parallel Processing and Applied Mathematics (PPAM), 2003, pp. 241–248. [9] M. Inaba, N. Katoh, and H. Imai, “Applications of weighted Voronoi diagrams and randomization to variance-based kclustering,” in SCG ’94: Proceedings of the tenth annual symposium on Computational geometry. NewYork,NY, USA: ACM, 1994, pp. 332–339. [10] D. Judd, P. K. McKinley, and A. K. Jain, “Large-scale parallel data clustering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 8, pp. 871–876, Aug. 1998. [11] S. Har-Peled and B. Sadri, “How fast is the k-means method?” in SODA ’05: Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, 2005, pp. 877–885. [12] D. Arthur and S. Vassilvitskii, “How slow is the k-means method?” in SCG ’06: Proceedings of the twenty-second annual symposium on Computational geometry. NewYork, NY, USA: ACM, 2006, pp. 144–153. [13] P. S. Bradley and U. M. Fayyad, “Refining initial points for KMeans clustering,” Proceedings of Fifteenth Intl. Conference on Machine Learning, pp. 91–99, 1998. [14] D. Pelleg and A. Moore, “X-Means: Extending K-Means with efficient estimation of the number of clusters,” Proceedings of the Seventeenth Intl. Conference on Machine Learning, pp. 727–734, 2000. [15] J. L. Bentley, “Multidimensional binary search trees used for associative searching,” Communications of the ACM, vol. 18, no. 9, pp. 509–517, 1975. [16] T. Zhang, R. Ramakrishnan, and M. Livny, “Birch: an efficient data clustering method for very large databases,” SIGMOD Rec., vol. 25, no. 2, pp. 103–114, 1996. [17] D. Foti, D. Lipari, C. Pizzuti, and D. Talia, “Scalable parallel clustering for data mining on multicomputers,” Proceedings of the 15th IPDPS 2000 Workshops on Parallel and Distributed Processing, pp. 390–398, 2000. [18] I. Al-furaih, S. Aluru, S. Goil, and S. Ranka, “Parallel construction of multidimensional binary search trees,” IEEE Transactions on Parallel and Distributed Systems, vol. 11, no. 2, pp. 136–148, Feb. 2000. [19] J.B. Kruskal and M. Wish, Multidimensional Scaling. Sage University Paper series on Quantitative Applications in the Social Sciences, Beverly Hills and London: Sage Publications, 07-011, 1978. [20] M. Baker, B. Carpenter, and A. Shafi, “MPJ Express: Towards thread safe Java HPC,” Proceedings of the IEEE International Conference on Cluster Computing, Sep. 2006.

CORE (COnnecting REpositories)

University Staff: Request a correction | Centaur Editors: Update this record

Date Deposited:	23 Jul 2010 13:38	Date item deposited into CentAUR
Last Modified:	20 Jan 2026 12:30	Date item last modified