Hive-Based Data Encryption for Securing Sensitive Data in HDFS

PDF (1064KB), PP.34-50

Views: 0 Downloads: 0

Author(s)

Shivani Awasthi 1,* Narendra Kohli 1

1. Department of Computer Science, HBTU, Kanpur, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijmsc.2024.04.04

Received: 22 Jul. 2024 / Revised: 24 Aug. 2024 / Accepted: 8 Oct. 2024 / Published: 8 Dec. 2024

Index Terms

AES, Avro, Deflate, ORC, Parquet, Snappy, Gzip, HDFS, cloud environment

Abstract

Big Data is a new class of technology that gives businesses more insight into their massive data sets, allowing them to make better business decisions and satisfy customers. Big data systems are also a desirable target for hackers due to the aggregation of their data. Hadoop is used to handle large data sets through reading and writing application programs on a distributed system. Hadoop Distributed File System is used to store massive data. Since HDFS does not safeguard data privacy, encrypting the file is the right way to protect the stored data in HDFS but takes a long time. In this paper, regarding privacy concerns, we use different compression-type data storage file formats with the proposed user-defined function (XOR-Onetime pad with AES) to secure data in HDFS. In this way, we provide a dual level of security by masking the selective data and whole data in the file. Our experiment demonstrates that the whole process time is significantly smaller than that of a conventional method. The proposed UDF with ORC, Zlib file format gives 9-10% better performance results than 2DES and other methods.  Finally, we decreased the load time of secure data and significantly improved query processing time with the Hive engine.

Cite This Paper

Shivani Awasthi, Narendra Kohli, "Hive-Based Data Encryption for Securing Sensitive Data in HDFS", International Journal of Mathematical Sciences and Computing(IJMSC), Vol.10, No.4, pp. 34-50, 2024. DOI: 10.5815/ijmsc.2024.04.04

Reference

[1]Moreno, J., Serrano, M., & Fernández-Medina, E. (2016). Main Issues in Big Data Security. Future Internet, 8(3), 44. https://doi.org/10.3390/fi8030044
[2]Yaqoob, I., Hashem, I. a. T., Gani, A., Mokhtar, S., Ahmed, E., Anuar, N. B., & Vasilakos, A. V. (2016). Big data: From beginning to future. International Journal of Information Management, 36(6), 1231–1247. https://doi.org/10.1016/j.ijinfomgt.2016.07.009
[3]Garlasu, D., Sandulescu, V., Halcu, I., Neculoiu, G., Grigoriu, O., Marinescu, M., & Marinescu, V. (2013). A big data implementation based on Grid computing. https://doi.org/10.1109/roedunet.2013.6511732
[4]Gandomi, A., & Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35(2), 137–144. https://doi.org/10.1016/j.ijinfomgt.2014.10.007
[5]Chen, C. P., & Zhang, C. Y. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences, 275, 314–347. https://doi.org/10.1016/j.ins.2014.01.015
[6]Rodríguez-Mazahua, L., Rodríguez-Enríquez, C. A., Sánchez-Cervantes, J. L., Cervantes, J., García-Alcaraz, J. L., & Alor-Hernández, G. (2015). A general perspective of Big Data: applications, tools, challenges and trends. the Journal of Supercomputing/Journal of Supercomputing, 72(8), 3073–3113. https://doi.org/10.1007/s11227-015-1501-1
[7]Hashem, I. a. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A., & Khan, S. U. (2015). The rise of “big data” on cloud computing: Review and open research issues. Information Systems, 47, 98–115. https://doi.org/10.1016/j.is.2014.07.006
[8]Chen, M., Mao, S., & Liu, Y. (2014). Big Data: A Survey. Journal on Special Topics in Mobile Networks and Applications/Mobile Networks and Applications, 19(2), 171–209. https://doi.org/10.1007/s11036-013-0489-0
[9]Thuraisingham, B. (2015). Big Data Security and Privacy. https://doi.org/10.1145/2699026.2699136
[10]Van Rijmenam, M. Think Bigger.(2014) AMACOM.  
[11]Polato, I., Ré, R., Goldman, A., & Kon, F. (2014). A comprehensive view of Hadoop research—A systematic literature review. Journal of Network and Computer Applications, 46, 1–25. https://doi.org/10.1016/j.jnca.2014.07.022
[12]Balusamy, B., R, N. a., Kadry, S., & Gandomi, a. H. (2021)Big data. John Wiley & Sons. Wiley. 
[13]Apache Hadoop 3.3.6: HDFS Architecture [Internet]. Available from: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html 
[14]Bansal, K., Chawla, P., & Kurle, P. (2018). Analyzing Performance of Apache Pig and Apache Hive with Hadoop. In Lecture notes in electrical engineering (pp. 41–51). https://doi.org/10.1007/978-981-13-1642-5_4
[15]Priya S. (2019) Big data: analytics, technologies, and applications.
[16]Wadhera, S., Kamra, D., Kumar, A., Jain, A., & Jain, V. (2021). A systematic Review of Big data tools and application for developments. 2021 2nd International Conference on Intelligent Engineering and Management (ICIEM). https://doi.org/10.1109/iciem51511.2021.9445326
[17]Apache Hive [Internet]. Available from: https://hive.apache.org/  
[18]Naidu V. Performance enhancement using appropriate file formats in big data Hadoop ecosystem. International Research Journal of Engineering and Technology (IRJET). 
[19]Morán, J., De La Riva, C., & Tuya, J. (2015). Testing data transformations in MapReduce programs. https://doi.org/10.1145/2804322.2804326 
[20]Cloudera Document [Internet]. Available from: https://docs.cloudera.com 
[21]Song, N. Y., Shin, N. Y. S., Jang, N. M., & Chang, N. J. W. (2017). Design and implementation of HDFS data encryption     scheme using ARIA algorithm on Hadoop. https://doi.org/10.1109/bigcomp.2017.7881720
[22]Mahmoud, H., Hegazy, A., & Khafagy, M. H. (2018). An approach for big data security based on Hadoop distributed file system. https://doi.org/10.1109/itce.2018.8316608
[23]Kamaruzaman, S. H., Nik, W. N. S. W., Mohamed, M. A., & Mohamad, Z. (2018). Design and Implementation of Data-at-Rest Encryption for Hadoop. International Journal of Engineering & Technology, 7(2.15), 54. https://doi.org/10.14419/ijet.v7i2.15.11212
[24]Teng, L., Li, H., Yin, S., & Sun, Y. A modified advanced encryption standard for data security. International Journal of Network Security. 
[25]Sonal Jain, Mohit Jain. (2019) Privacy Preserving mining using data encryption scheme for Hadoop Ecosystem. International Journal for Rapid Research in Engineering Technology & Applied Science. 
[26]Khafagy, O. H., Ibrahim, M. H., & Omara, F. A. (2020). Hybrid-Key Stream Cipher Mechanism for Hadoop Distributed File System Security. https://doi.org/10.1109/itce48509.2020.9047775
[27]Jayapandian, N. (2020). SECURING CLOUD DATA AGAINST CYBER-ATTACKS USING HYBRID AES WITH MHT ALGORITHM. Computing, 561–568. https://doi.org/10.47839/ijc.19.4.1989
[28]Gattoju, S., & Nagalakshmi. (2021). AN EFFICIENT APPROACH FOR BIGDATA SECURITY BASED ON HADOOP SYSTEM USING CRYPTOGRAPHIC TECHNIQUES. Indian Journal of Computer Science and Engineering, 12(4), 1027–1037. https://doi.org/10.21817/indjcse/2021/v12i4/211204132
[29]Kaushik, A., & Srivastava, V. K. (2020). Performance Analysis Of AES And DESede On The Sensitive Data Stored In HDFS. 2020 2nd International Conference on Advances in Computing, Communication Control and Networking (ICACCCN). https://doi.org/10.1109/icacccn51052.2020.9362755
[30]Algaradi, T., & Rama. (2021) A new encryption scheme for performance improvement in big data environment using MapReduce. Journal of Engineering Science and Technology. 
[31]Mohanraj, T., & R. Santhosh. (2022) Hybrid encryption algorithm for big data security in the Hadoop distributed file system. Computer Assisted Methods in Engineering and Science. 
[32]Khadji, Khoulji, & Kerkeb, M. (2023) Efficient Big Data security: Evaluating the performance of a proposed hybrid key management algorithm using lightweight cryptography. Journal of Theoretical and Applied Information Technology. 
[33]Ramya, P., & Sundar, C. (2020). SecDedoop: Secure Deduplication with Access Control of Big Data in the HDFS/Hadoop Environment. Big Data, 8(2), 147–163. https://doi.org/10.1089/big.2019.0120
[34]Gupta, M., & Dwivedi, R. K. (2023). Fortified MapReduce Layer: Elevating Security and Privacy in Big Data. ICST Transactions on Scalable Information Systems. https://doi.org/10.4108/eetsis.3859
[35]Iavich, M., & Kevanishvili, Z. Modified one time pad. ResearchGate. Scientific and Practical Cyber Security Journal (SPCSJ). 2018; 
[36]Fathy, A., Tarrad, I. F., Hamed, H. F. A., & Awad, A. I. (2012). Advanced Encryption Standard Algorithm: Issues and Implementation Aspects. In Communications in computer and information science (pp. 516–523). https://doi.org/10.1007/978-3-642-35326-0_51
[37]Rihan, S., Khalid, & F. Oshman, S. A Performance Comparison of Encryption Algorithms AES and DES. International Journal of Engineering Research & Technology, vol 4(12). International Journal of Engineering Research & Technology. 2015; 
[38]LanguageManual - Apache Hive - Apache Software Foundation [Internet]. Available from: https://cwiki.apache.org/confluence/display/hive/languagemanual
[39]Ngo, S. Compressing Parquet Files: A Basic Guide [Internet]. Assisty - Shopify Data Analytics Solution. Available from: https://assisty.ai/compressing-parquet-files-a-basic-guide/   
[40]Apache Avro Data Source Guide - Spark 2.4.0 Documentation [Internet]. Available from: https://spark.apache.org/docs/2.4.0/sql-data-sources-avro.html 
[41]Kaggle: your machine learning and data science community [Internet]. Available from: https://kaggle.com/.
[42]Arshad, M. J., Department of Computer Science, Virtual University of Pakistan, Lahore, Umair, M., Munawar, S., Naveed, N., & Naeem, H. (2020). Improving cloud data encryption using customized genetic algorithm. International Journal of Intelligent Systems and Applications, 12(6), 46–63. https://doi.org/10.5815/ijisa.2020.06.04