Chuanqing Shi; Shengfa Gao

A Checkpointing Algorithm Based Unreliable Non-FIFO Channels

Full Text (PDF, 372KB), PP.35-48

Views: 0 Downloads: 0

Author(s)

Chuanqing Shi ^1,* Shengfa Gao ¹

1. College of Computer Science and Technology, ShanDong University, Jinan Shandong

* Corresponding author.

DOI: https://doi.org/10.5815/ijem.2012.04.05

Received: 16 May 2012 / Revised: 13 Jun. 2012 / Accepted: 18 Jul. 2012 / Published: 29 Aug. 2012

Index Terms

Unreliable non-FIFO channel, message losses, message duplicate, message reordering, consistent global checkpoints

Abstract

We propose a coordinated checkpointing algorithm based unreliable non-FIFO channel. In unreliable non-FIFO channel, the system can lose, duplicate, or reorder messages. The processes may not compute some messages because of message losses; the processes may compute some messages twice or more because of message duplicate; the processes may not compute messages according to their sending order because of message reordering. The above-mentioned problems make processes produce incorrect computation result, consequently, prevent processes from taking consistent global checkpoints. Our algorithm assigns each message a sequence number in order to resolve above-mentioned problems. During the establishing of the checkpoint, the consistency of checkpoint can be determined by the sequence number of sending and receiving messages. We can identify the lost messages, reordering messages and duplicate messages by checking the sequence number of sending and receiving messages. We resolve above-mentioned problems by resending the lost messages, buffering the reordering messages and dropping the duplicate messages. Our algorithm makes processes take consistent global checkpoints.

Cite This Paper

Chuanqing Shi,Shengfa Gao,"A Checkpointing Algorithm Based Unreliable Non-FIFO Channels", IJEM, vol.2, no.4, pp.35-48, 2012. DOI: 10.5815/ijem.2012.04.05

Reference

[1]B. Lampson and H. Sturgis, "Crash recovery in a distributed storage system," Xerox Palo Alto Research Center, Tech. Rep., Apr. 1979.

[2]Elnozahy, E. N., D. B. Johnson and W. Zwaenepoel, “The Performance of Consistent Checkpointing,” Proc. 11th Symp. Reliable Distributed Systems, pp. 86-95, Oct. 1992.

[3]Koo, R. and S. Toueg, “Checkpointing and Rollback-Recovery for Distributed Systems,” IEEE Trans. Software Eng., vol. 13, no. 1, pp. 23-31, Jan. 1987.

[4]Silva, L.M. and J.G. Silva, “Global Checkpointing for Distributed Programs,” Proc. 11th Symp. Reliable Distributed Systems, pp. 155-162, Oct. 1992.

[5]K. Bhatia, K. Marzullo and L. Alvisi. “The relative overhead of piggybacking in causal message logging protocols,” In Proceedings of the Seventeenth Symposium on Reliable Distributed Systems, pp. 348—353, 1998.

[6]R. Netzer and J. Xu, “Necessary and Sufficient Conditions for Consistent Global Snapshots,” IEEE Trans. Parallel and Distributed Systems, pp. 165-169, Feb. 1995.

[7]K. Chandy and L. Lamport. “Distributed snapshots: Determining global states of distributed systems,”ACM Trans. Comput. Systems, vol. 3, no. 1, pages 63-75, February 1985.

[8]Leslie Lamport. “Time, Clocks, and the Ordering of Events in a Distributed System,” ACM Communications,21(7),558-565, July 1978.

[9]Mattern F. “Virtual time and global states of Distributed Systems,” Proc.of Parallel and Distributed Alogrithms Conf. 215-254,1988.

[10]Shengfa Gao,Xin Li, Ruihua Zhang. “The Extended Finite State Machine and Fault Tolerant Mechanism in Distributed Systems,” 2009 Seventh ACIS International Conference on Software Engineering Research, Management and Applications.33-38, 2009.

International Journal of Engineering and Manufacturing (IJEM)