Intelligent agents for fault tolerance: from multi-agent simulation to cluster-based implementation
Varghese, B., McKee, G. and Alexandrov, V. (2010) Intelligent agents for fault tolerance: from multi-agent simulation to cluster-based implementation. In: 2010 IEEE 24th International Conference on Advanced Information Networking and Applications Workshops (WAINA). IEEE, pp. 985-990. ISBN 9781424467013
Full text not archived in this repository.
To link to this article DOI: 10.1109/WAINA.2010.21
Recent research in multi-agent systems incorporate fault tolerance concepts, but does not explore the extension and implementation of such ideas for large scale parallel computing systems. The work reported in this paper investigates a swarm array computing approach, namely 'Intelligent Agents'. A task to be executed on a parallel computing system is decomposed to sub-tasks and mapped onto agents that traverse an abstracted hardware layer. The agents intercommunicate across processors to share information during the event of a predicted core/processor failure and for successfully completing the task. The feasibility of the approach is validated by simulations on an FPGA using a multi-agent simulator, and implementation of a parallel reduction algorithm on a computer cluster using the Message Passing Interface.