Scalable and fault tolerant failure detection and consensus

Katti, Amogh; Di Fatta, Giuseppe; Naughton, Thomas; Engelmann, Christian

Download

[thumbnail of DiFatta-2015-EuroMPI-distr.pdf]

Text
- Accepted Version
· Restricted to Repository staff only
· The Copyright of this document has not been checked yet. This may affect its availability.

Advice

Please see our End User Agreement.

It is advisable to refer to the publisher's version if you intend to cite from this work. See Guidance on citing.

Tools

Lists

Katti, A., Di Fatta, G., Naughton, T. and Engelmann, C. (2015) Scalable and fault tolerant failure detection and consensus. In: The 22nd European MPI Users' Group Meeting (EuroMPI '15), 21-23 September 2015, Bordeaux, France, Article No. 13. doi: 10.1145/2802658.2802660 (ISBN 9781450337953)

Abstract/Summary

Future extreme-scale high-performance computing systems will be required to work under frequent component failures. The MPI Forum's User Level Failure Mitigation proposal has introduced an operation, MPI_Comm_shrink, to synchronize the alive processes on the list of failed processes, so that applications can continue to execute even in the presence of failures by adopting algorithm-based fault tolerance techniques. This MPI_Comm_shrink operation requires a fault tolerant failure detection and consensus algorithm. This paper presents and compares two novel failure detection and consensus algorithms. The proposed algorithms are based on Gossip protocols and are inherently fault-tolerant and scalable. The proposed algorithms were implemented and tested using the Extreme-scale Simulator. The results show that in both algorithms the number of Gossip cycles to achieve global consensus scales logarithmically with system size. The second algorithm also shows better scalability in terms of memory and network bandwidth usage and a perfect synchronization in achieving global consensus.

Altmetric Badge

Dimensions Badge

Item Type	Conference or Workshop Item (Paper)
URI	https://centaur.reading.ac.uk/id/eprint/50881
Identification Number/DOI	10.1145/2802658.2802660
Refereed	Yes
Divisions	Science > School of Mathematical, Physical and Computational Sciences > Department of Computer Science
Download/View statistics	View download statistics for this item

Related URLs

Deposit Details

CORE (COnnecting REpositories)

University Staff: Request a correction | Centaur Editors: Update this record

Date Deposited:	18 Jan 2016 13:45	Date item deposited into CentAUR
Last Modified:	20 Jan 2026 16:15	Date item last modified