Accessibility navigation

Non-uniform data distribution for communication-efficient parallel clustering

Goodall, T., Pettinger, D. and Di Fatta, G. (2013) Non-uniform data distribution for communication-efficient parallel clustering. Journal of Computational Science, 4 (6). pp. 489-495. ISSN 1877-7503

Full text not archived in this repository.

It is advisable to refer to the publisher's version if you intend to cite from this work. See Guidance on citing.

To link to this item DOI: 10.1016/j.jocs.2013.01.007


Global communicationrequirements andloadimbalanceof someparalleldataminingalgorithms arethe major obstacles to exploitthe computational power of large-scale systems. This work investigates how non-uniform data distributions can be exploited to remove the global communication requirement and to reduce the communication costin parallel data mining algorithms and, in particular, in the k-means algorithm for cluster analysis. In the straightforward parallel formulation of the k-means algorithm, data and computation loads are uniformly distributed over the processing nodes. This approach has excellent load balancing characteristics that may suggest it could scale up to large and extreme-scale parallel computing systems. However, at each iteration step the algorithm requires a global reduction operationwhichhinders thescalabilityoftheapproach.Thisworkstudiesadifferentparallelformulation of the algorithm where the requirement of global communication is removed, while maintaining the same deterministic nature ofthe centralised algorithm. The proposed approach exploits a non-uniform data distribution which can be either found in real-world distributed applications or can be induced by means ofmulti-dimensional binary searchtrees. The approachcanalso be extended to accommodate an approximation error which allows a further reduction ofthe communication costs. The effectiveness of the exact and approximate methods has been tested in a parallel computing system with 64 processors and in simulations with 1024 processing element

Item Type:Article
Divisions:Science > School of Mathematical, Physical and Computational Sciences > Department of Computer Science
ID Code:32253
Uncontrolled Keywords:Parallel data mining; Clustering; k-Means; Group communication; Extreme-scale computing

University Staff: Request a correction | Centaur Editors: Update this record

Page navigation