The Fault Tolerant Parallel Algorithm: the Parallel Recomputing Based Failure Recovery

Xuejun Yang,  Yunfei Du,  Panfeng Wang,  Hongyi Fu,  Jia Jia,  Zhiyuan Wang,  Guang Suo
National Laboratory for Paralleling and Distributed Processing, School of Computer, National University of Defense Technology


With the increasing of the size of large scale computer systems, their MTBFs are becoming significantly shorter than the execution time of many current computational science programs. Computational science programs must tolerate failures. Checkpoint based methods, currently used on most machines, save the state of a computation to stable storage periodically, and roll back all processes to the last checkpoint upon a failure. However, the methods are a significant waste of computation as all processes have to redo all the computation from that checkpoint onwards. In addition, the time of fault recovery is bound by the time between the last checkpoint and the crash. This paper addresses the issue of fault tolerance in parallel computing, and proposes a new method named parallel recomputing. Such method achieves fault recovery automatically by using surviving processes to recompute the workload of failed processes in parallel. The paper firstly defines the fault tolerant parallel algorithm (FTPA) as the parallel algorithm which tolerates failures by parallel recomputing. Furthermore, the paper proposes the inter-process definition-use relationship analysis method based on the conventional definition-use analysis for revealing the relationship of variables in different processes. Under the guidance of this new method, principles of fault tolerant parallel algorithm design are given. At last, the authors present the design of FTPAs for matrix-matrix multiplication and NPB kernels, and evaluate them by experiments on a cluster system. The experimental results show that the overhead of FTPA is less than the overhead of checkpointing.