Delay Scheduling Based Replication Scheme for Hadoop Distributed File System

Full Text (PDF, 269KB), PP.73-78

Views: 0 Downloads: 0

Author(s)

S.Suresh 1,* N.P. Gopalan 1

1. Department of Computer Applications, National Institute of Technology, Tiruchirappalli - 620015, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijitcs.2015.04.08

Received: 3 Jul. 2014 / Revised: 20 Oct. 2014 / Accepted: 11 Jan. 2015 / Published: 8 Mar. 2015

Index Terms

Dynamic Replication, HDFS, Delay Scheduling, Hadoop Mapreduce

Abstract

The data generated and processed by modern computing systems burgeon rapidly. MapReduce is an important programming model for large scale data intensive applications. Hadoop is a popular open source implementation of MapReduce and Google File System (GFS). The scalability and fault-tolerance feature of Hadoop makes it as a standard for BigData processing. Hadoop uses Hadoop Distributed File System (HDFS) for storing data. Data reliability and fault-tolerance is achieved through replication in HDFS. In this paper, a new technique called Delay Scheduling Based Replication Algorithm (DSBRA) is proposed to identify and replicate (dereplicate) the popular (unpopular) files/blocks in HDFS based on the information collected from the scheduler. Experimental results show that, the proposed method achieves 13% and 7% improvements in response time and locality over existing algorithms respectively.

Cite This Paper

S. Suresh, N.P. Gopalan, "Delay Scheduling Based Replication Scheme for Hadoop Distributed File System", International Journal of Information Technology and Computer Science(IJITCS), vol.7, no.4, pp.73-78, 2015. DOI:10.5815/ijitcs.2015.04.08

Reference

[1]Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “The Google File System”, In 19th Symposium on Operating Systems Principles, Lake George, New York, pp. 29–43, 2003.

[2]Konstantin Shvachko, Hairong Kuang, Sanjay Radia and Robert Chansler, "The Hadoop Distributed File System", IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp.1-10, 2010.

[3]A. Lakshman and P. Malik, “Cassandra: A decentralized structured storage system”, SIGOPS Operating Syst. Rev., vol. 44, no. 2, 2010.

[4]F. Chang, et al., “Bigtable: A distributed storage system for structured data,” ACM Trans. Comput. Syst., vol. 26, no. 2, 2008.

[5]Apache Hadoop. http://hadoop.apache.org/. Accessed on 13 June, 2014.

[6]J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters”, In Proceedings of the 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pp. 137–150, 2004.

[7]Feng Wang et al., “Hadoop high availability through metadata replication”, In Proceedings of the first international workshop on Cloud data management (CloudDB '09), ACM, New York, NY, USA, pp. 37-44, 2009.

[8]Lin-Wen Lee et al, “File Assignment in Parallel I/O Systems with Minimal Variance of Service Time”, IEEE Transactions on Computers, vol. 49, no. 2, Feb 2000.

[9]Jiong Xie et al., “Improving MapReduce performance through data placement in heterogeneous Hadoop clusters”, Symposium on Parallel and Distributed Processing, pp.1-9, 2010.

[10]W.H. Li et al., “A novel cost-effective dynamic data replication strategy for reliability in cloud data centres”, in: IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing, 2011.

[11]Q. Wei et al., “CDRM: a cost-effective dynamic replication management scheme for cloud storage cluster”, in: Proc. 2010 IEEE International Conference on Cluster Computing, Heraklion, Crete, Greece, September 20–24, pp. 188–196, 2010.

[12]Sai-Qin Long, Yue-Long Zhao and Wei Chen, “MORM: A Multi-objective Optimized Replication Management strategy for cloud storage cluster”, Journal of Systems Architecture, vol. 60, no. 2, pp. 234–244, Feb 2014.

[13]K. Ranganathan, I.T. Foster, Identifying dynamic replication strategies for a high-performance data grid, in: Proc. Second Int’l Workshop Grid Computing (GRID), 2001.

[14]H. Lamehamedi, Z. Shentu, B. Szymanski, Simulation of dynamic data replication strategies in data grids, in: Proc. 12th Heterogeneous Computing Workshop (HCW2003) Nice, France, April 2003, IEEE Computer Science Press, Los Alamitos, CA, 2003.

[15]R.S. Chang and H.P. Chang, “A dynamic data replication strategy using access weights in data grids”, J. Super comput. Vol. 45, No. 3, pp. 277–295, 2008.

[16]S.C. Choi and H.Y. Youn, “Dynamic hybrid replication effectively combining tree and grid topology”, J. Supercomput. vol. 59, pp. 1289–1311, 2012.

[17]T. Xie, Y. Sun, A file assignment strategy independent of workload characteristic assumptions, ACM Trans. Storage, vol. 5, no. 3, 2009.

[18]L. Hellerstein et al., "Coding techniques for handling failures in large disk arrays", Algorithmica, vol. 12, vo. 3-4, pp. 182-208, 1994.

[19]M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica, “Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling”, In Proceedings of the 5th European Conference on Computer systems (EuroSys), 2010.

[20]Hive performance benchmarks. http://issues.apache.org/jira/browse/HIVE-396. Accessed on 17 June, 2014.