A Compression & Encryption Algorithm on DNA Sequences Using Dynamic Look up Table and Modified Huffman Techniques

Full Text (PDF, 788KB), PP.39-61

Views: 0 Downloads: 0

Author(s)

Syed Mahamud Hossein 1,* S.Roy 1

1. Regional Office, Kolaghat; Directorate of Vocational Education & Training, West Bengal, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijitcs.2013.10.05

Received: 20 Dec. 2012 / Revised: 6 Apr. 2013 / Accepted: 23 Jun. 2013 / Published: 8 Sep. 2013

Index Terms

Compression, Security

Abstract

Storing, transmitting and security of DNA sequences are well known research challenge. The problem has got magnified with increasing discovery and availability of DNA sequences. We have represent DNA sequence compression algorithm based on Dynamic Look Up Table (DLUT) and modified Huffman technique. DLUT consists of 43(64) bases that are 64 sub-stings, each sub-string is of 3 bases long. Each sub-string are individually coded by single ASCII code from 33(!) to 96(`) and vice versa. Encode depends on encryption key choose by user from four base pair {a,t.g and c}and decode also require decryption key provide by the encoded user. Decoding must require authenticate input for encode the data. The sub-strings are combined into a Dynamic Look up Table based pre-coding routine. This algorithm is tested on reverse; complement & reverse complement the DNA sequences and also test on artificial DNA sequences of equivalent length. Speed of encryption and security levels are two important measurements for evaluating any encryption system. Due to proliferate of ubiquitous computing system, where digital contents are accessible through resource constraint biological database security concern is very important issue. A lot of research has been made to find an encryption system which can be run effectively in those biological databases. Information security is the most challenging question to protect the data from unauthorized user. The proposed method may protect the data from hackers. It can provide the three tier security, in tier one is ASCII code, in tier two is nucleotide (a,t,g and c) choice by user and tier three is change of label or change of node position in Huffman Tree. Compression of the genome sequences will help to increase the efficiency of their use. The greatest advantage of this algorithm is fast execution, small memory occupation and easy implementation. Since the program to implement the technique have been written originally in the C language, (Windows XP platform, and TC compiler) it is possible to run in other microcomputers with small changes (depending on platform and Compiler used). The execution is quite fast, all the operations are carried out in fraction of seconds, depending on the required task and on the sequence length. The technique can approach an effective compression ratio of 1.98 bits/base and even lower. When a user searches for any sequence for an organism, an encrypted compressed sequence file can be sent from the data source to the user. The encrypted compressed file then can be decrypted & decompressed at the client end resulting in reduced transmission time over the Internet. An encrypt compression algorithm that provides a moderately high compression with encryption rate with minimal decryption with decompression time.

Cite This Paper

Syed Mahamud Hossein, S.Roy, "A Compression & Encryption Algorithm on DNA Sequences Using Dynamic Look up Table and Modified Huffman Techniques", International Journal of Information Technology and Computer Science(IJITCS), vol.5, no.10, pp.39-61, 2013. DOI:10.5815/ijitcs.2013.10.05

Reference

[1]M. Li and P. Vitányi, An Introduction to Kolmogorov Complexity and Its Applications, 2nd ed. New York: Springer-Verlag, 1997.

[2]R. Curnow and T. Kirkwood, “Statistical analysis of deoxyribonucleic acid sequence data-a review,” J. Royal Statistical Soc., vol. 152, pp. 199-220, 1989.

[3]S. Grumbach and F. Tahi, “A new challenge for compression algorithms: Genetic sequences,” J. Inform. Process. Manage., vol. 30, no. 6, pp. 875-866, 1994.

[4]É. Rivals, O. Delgrange, J.P. Delahaye, M.Dauchet, M.O. Delorme et al., “Detection of significant patterns by compression algorithms: the case of Approximate Tandem Repeats inDNAsequences,” CABIOS, vol. 13, no. 2, pp. 131-136,1997.

[5]K. Lanctot, M. Li, and E.H. Yang, “Estimating DNA sequence entropy,”in Proc. SODA 2000, to be published.

[6]D. Loewenstern and P. Yianilos, “Significantly lower entropy estimates for natural DNA sequences,” J. Comput. Biol., to be published (Preliminary version appeared in a DIMACS workshop, 1996.)

[7]Two algorithms for constructing efficient huffman-code based reversible variable length Codes Chia-Wei Lin; Ja-Ling Wu; Yuh-Jue Chuang

[8]Bentley J. L., Sleator D.D., Tarjan R.E., and Wei V., "A locally adaptive data compression scheme", Communications of the ACM, 29(4), 320-330, 1986.

[9]J. G. Cleary and I. H. Witten. Data compression using adaptive coding and partial string matching. IEEE Trans. Comm., COM-32(4):396–402, April 1984.

[10]C. E. Shannon, “A mathematical theory of communication,” The Bell System Technical Journal, vol. 27, 1948.

[11]D. A. Huffman, “A method for the construction of minimum-redundancy codes,“Proc. IRE, vol. 40, pp. 1098-1101,1952.

[12]On the competitive optimality of Huffman codes by Thomas. M. Cover.

[13]Guaranteed Synchronization of Huffman Codes with Known Position of Decoder Marek Tomasz Biskup, Wojciech Plandowski,

[14]C. E. Shannon, “Communication theory of secrecy systems,” Bell Systems Technical Journal, v. 28, October 1949, pp. 656-715. 

[15]Chen, L., Lu, S. and Ram J. 2004. “Compressed Pattern Matching in DNA Sequences”. Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference (CSB 2004)

[16]Toshiko Matsumoto, Kunihiko Sadakane and Hiroshi Imai. “ Biological Sequence Compression Algorithms”, Genome Informatics 11 : 43-52 (2000)

[17]S. Grumbach and F. Tahi, “A new challenge for compression algorithms: Genetic sequences,” J. Inform. Process. Manage., vol. 30, no. 6, pp. 875-866, 1994.

[18]Xin Chen, San Kwong and Mine Li, “A Compression Algorithm for DNA Sequences Using Approximate Matching for Better Compression Ratio to Reveal the True Characteristics of DNA”, IEEE Engineering in Medicine and Biology,pp 61-66,July/August 2001.

[19]Adam Drozdek “ Elements of Data Compression”, Vikas Publishing House (2002)

[20]T. Matsumoto,K.Sadakame and H. Imani, ”Biological sequence compression algorithm”, Genome Informatics 11:43-52 (2000).

[21]ASCII code. [Online]. Available: http://www.asciitable.com

[22]National Center for Biotechnology Information, http://www.ncbi.nlm.nih.gov