A study and Performance Comparison of MapReduce and Apache Spark on Twitter Data on Hadoop Cluster

Full Text (PDF, 526KB), PP.61-70

Views: 0 Downloads: 0

Author(s)

Md. Nowraj Farhan 1,* Md. Ahsan Habib 2 Md. Arshad Ali 2

1. Department of Computer Science & Engineering, University of Liberal Arts Bangladesh, Dhaka, 1209, Bangladesh

2. Faculty of Computer Science and Engineering, Hajee Mohammad Danesh Science and Technology University, Dinajpur, 5200, Bangladesh

* Corresponding author.

DOI: https://doi.org/10.5815/ijitcs.2018.07.07

Received: 17 Mar. 2018 / Revised: 11 May 2018 / Accepted: 20 Jun. 2018 / Published: 8 Jul. 2018

Index Terms

Big data, Hadoop, Java Virtual Machine (JVM), MapReduce, Supervised Learning, Apache Spark

Abstract

We explore Apache Spark, the newest tool to  analyze big data, which lets programmers perform in-memory computation on large data sets in a fault tolerant manner. MapReduce is a high-performance distributed BigData programming framework which is highly preferred by most big data analysts and is out there for a long time with a very good documentation. The purpose of this project was to compare the scalability of open-source distributed data management systems like Apache Hadoop for small and medium data sets and to compare it’s performance against the Apache Spark, which is a scalable distributed in-memory data processing engine. To do this comparison some experiments were executed on data sets of size ranging from 5GB to 43GB, on both single machine and on a Hadoop cluster. The results show that the cluster outperforms the computation of a single machine by a huge range. Apache Spark outperforms MapReduce by a dramatic margin, and as the data grows Spark becomes more reliable and fault tolerant. We also got an interesting result that, with the increase of the number of blocks on the Hadoop Distributed File System, also increases the run-time of both the MapReduce and Spark programs and even in this case, Spark performs far more better than MapReduce. This demonstrates Spark as a possible replacement of MapReduce in the near future.

Cite This Paper

Nowraj Farhan, Ahsan Habib, Arshad Ali, "A Study and Performance Comparison of MapReduce and Apache Spark on Twitter Data on Hadoop Cluster", International Journal of Information Technology and Computer Science(IJITCS), Vol.10, No.7, pp.61-70, 2018. DOI:10.5815/ijitcs.2018.07.07

Reference

[1]Marissa Rae Hollingsworth, “Hadoop and Hive as Scalable Alternatives to RDBMS- A Case Study”, January 2012. Available: http://scholarworks.boisestate.edu/cs_gradproj/2/. [Accessed: 21 – Dec – 2017]

[2]Jodi Blomberg, “Twitter and Facebook Analysis: It’s Not Just for Marketing Anymore”, 2012. Available: http://support.sas.com/resources/papers/proceedings12/309-2012.pdf. [Accessed: 11 – Dec – 2017]

[3]Vora, M.N, “Hadoop-HBase for large-scale data”,  December 2011. Available: http://ieeexplore.ieee.org/document/6182030/?reload=true. [Accessed: 1 – Jan - 2018]

[4]Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica, “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing'” July 2011. Available: https://www.usenix.org/node/162809. [Accessed: 23 – Nov - 2017]

[5]Penchalaiah.C, Sri.G.Murali, Dr.A.SureshBabu, “Effective Sentiment Analysis on Twitter Data using: Apache Flume and Hive”. Available: http://www.ijiset.com/v1s8/IJISET_V1_I8_14.pdf. [Accessed: 3 – Dec - 2017]

[6]Weiguo Fan, Linda Wallace, Stephanie Rich, Zhongju Zhang, 'Tapping into the Power of Text Mining' 2005. Available: https://cacm.acm.org/magazines/2006/9/5835-tapping-the-power-of-text-mining/abstract. [Accessed: 14 – Jan - 2018]

[7]Dipesh Shrestha, “Text Mining With Lucene And Hadoop: Document Clustering With Feature Extraction”,  2009. Available: https://pdfs.semanticscholar.org/36ce/71c9ff15cc46b32ab35d30d4b3b1c58cbfc6.pdf. [Accessed: ]

[8]Alan Ritter, Mausam, Oren Etzioni. “Open Domain Event Extraction from Twitter”, 2012. Available: http://www.cse.iitd.ac.in/~mausam/papers/kdd12.pdf. [Accessed: 7 – Feb – 2018]

[9]Dean J and Ghemawat S, “MapReduce simplified data processing on large clusters”, Communications of the ACM 51:107-113, 2008.

[10]Pankaj Deep Kaur, Amneet Kaur, Sandeep Kaur,"Performance Analysis in Bigdata", IJITCS, vol.7, no.11, pp.55-61, 2015. DOI: 10.5815/ijitcs.2015.11.07

[11]Luis Emilio Alvarez-Dionisi,"Toward Grasping the Dynamic Concept of Big Data", International Journal of Information Technology and Computer Science(IJITCS), Vol.8, No.7, pp.8-15, 2016. DOI: 10.5815/ijitcs.2016.07.02

[12]D. Newberry, “The role of small and medium-sized enterprises in the futures of emerging economies”, Technical report, World Research Institute, 2006. Available: http://earthtrends.wri.org/features/view_feature.php?fid=69&theme=5. [Accessed: 5 - March - 2014].

[13]Big Data Working Group, “Big Data Analytics for Security Intelligence”, Cloud Security Alliance, pp. 1-22, 2013.

[14]Dean, J and Ghemawat, J, “MapReduce: Simplified Data Processing on Large Clusters”, In the Proceedings of the 6th Symposium on Operating Systems Design and Implementation, pp. 137-149, 2004.

[15]E. Benson, A. Haghighi, and R. Barzilay, “Event discovery in social media feeds” Proc. Of the 49th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 389-398, 2011.