Juan Li; Souvik Sen; Nazia Zaman

Entity Extraction from Business Emails

Full Text (PDF, 465KB), PP.15-22

Views: 0 Downloads: 0

Author(s)

Juan Li ^1,* Souvik Sen ¹ Nazia Zaman ¹

1. North Dakota State University, Computer Science Department, Fargo, 58078, USA

* Corresponding author.

DOI: https://doi.org/10.5815/ijitcs.2015.09.03

Received: 7 Oct. 2014 / Revised: 23 Feb. 2015 / Accepted: 11 Apr. 2015 / Published: 8 Aug. 2015

Index Terms

Email, entity extraction, natural language processing

Abstract

Email still plays an important role in today's business communication thanks to its simplicity, flexibility, low cost, and compatibility of diversified types of information. However processing the large amount of emails received consumes tremendous time and human power for a business. In order to quickly deciphering information and locate business-related information from emails received from a business, a computerized solution is required. In this paper, we have proposed a comprehensive mechanism to extract important information from emails. The proposed solution integrates semantic web technology with natural language processing and information retrieval. It enables automatic extraction of important entities from an email and makes batch processing of business emails efficient. The proposed mechanism has been used in a Transportation company.

Cite This Paper

Juan Li, Souvik Sen, Nazia Zaman, "Entity Extraction from Business Emails", International Journal of Information Technology and Computer Science(IJITCS), vol.7, no.9, pp.15-22, 2015. DOI:10.5815/ijitcs.2015.09.03

Reference

[1]Tang, G., Pei, J., & Luk, W. S. (2013). Email mining: tasks, common techniques, and tools. Knowledge and Information Systems, 1-31.

[2]Kok, S., & Yih, W. T. (2009). Extracting product information from email receipts using markov logic. In Proceedings of the Sixth Conference on Email and Anti-Spam, Mountain View, California, USA.

[3]“Pew Internet Report: Online Activities 2010”. Pew Research Center. May 2010. Web. Aug 2014. <http://tinyurl.com/ pewOnline10>

[4]“Jones, J.: Gallup: Almost All E-Mail Users Say Internet, E-Mail Have Made Lives Better.” Gallup. July 2001. Web. Aug 2014. <http://tinyurl.com/Gallup01>

[5]“The Radicati Group, Inc.: Email Statistics Report, 2010.” Editor: Sara Radicati. The Radicati Group Inc. 2010. Web. Aug 2014. <http://tinyurl.com/RadicatiEmail10>

[6]“Taming the Growth of Email – An ROI Analysis (White Paper).” HP, The Radicati Group, Inc. Mar 2005. Web. Sept 2014. <http://tinyurl.com/RadicatiEmail05>

[7]“80 % of Users Prefer E-Mail as Business Communication Tool.” META Group Inc. 2003. Web. Sept 2014. <http://tinyurl.com/MetaEmail03>

[8]“Networked Workers. PewInternet report” Madden, M.—Jones, S. Pew Research Center.Sept 24, 2008. Web. Sept 2014. http://tinyurl.com/pewNetWrks08

[9]Laclavík, Michal, et al. "Email analysis and information extraction for enterprise benefit." Computing and informatics 30.1 (2012): 57-87.

[10]Whittaker, Steve, and Candace Sidner. "Email overload: exploring personal information management of email." Proceedings of the SIGCHI conference on Human factors in computing systems. ACM, 1996.

[11]Fisher, D.—Brush, A. J.—Gleave, E.—Smith, M.A.: Revisiting Whit-taker&Sidner’s “Email Overload” Ten Years Later. In CSCW2006, New York ACM Press 2006.

[12]Corbató, F. J., Merwin-Daggett, M., & Daley, R. C. (1962, May). An experimental timesharing system. In Proceedings of the May 1-3, 1962, spring joint computer conference (pp. 335-344). ACM.

[13]“Natural Language Processing.” Wikipedia, the free encyclopedia. Wikimedia Foundation, Inc. 26 October 2014. Web. 27 October 2014.

[14]http://en.wikipedia.org/wiki/Natural_language_processing

[15]“Named Entity Recognition.” Wikipedia, the free encyclopedia. Wikimedia Foundation, Inc. 26 October 2014. Web. 27 October 2014. http://en.wikipedia.org/wiki/Namedentity_recognition

[16]Zhou, GuoDong, and Jian Su. "Named entity recognition using an HMM-based chunk tagger." proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2002.

[17]Cunningham, H.—Maynard, D.—Bontcheva, K.—Tablan, V.: GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL ’02), Philadelphia.

[18]Fernández, Miriam, Iván Cantador, Vanesa López, David Vallet, Pablo Castells, and Enrico Motta. "Semantically enhanced Information Retrieval: an ontology-based approach." Web Semantics: Science, Services and Agents on the World Wide Web 9, no. 4 (2011): 434-452.

[19]Cimiano, P.—Ladwig, G.—Staab, S.: Gimme’ the Context: Context-Driven Automatic Semantic Annotation With C-Pankow. In WWW’05: Proceedings of the 14th international conference on World Wide Web, New York, NY, USA. ACM Press. ISBN 1-59593-046-9,2005, pp. 332–341.

[20]Laclavík, Michal, et al. "Email analysis and information extraction for enterprise benefit." Computing and informatics 30.1 (2012): 57-87.

[21]Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating Non local Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363-370. <http://nlp.stanford.edu/~manning/papers/gibbscrf3.pdf>

[22]Tang, Guanting, Jian Pei, and Wo-Shun Luk. "Email mining: tasks, common techniques, and tools." Knowledge and Information Systems 41, no. 1 (2014): 1-31.

[23]“Gazetteer.” Wikipedia, the free encyclopedia. Wikimedia Foundation, Inc. 01 August 2014. Web. 20 October 2014. < http://en.wikipedia.org/wiki/Gazetteer>

[24]Aho, Alfred V, Corasick, Margaret J. (June 1975). "Efficient string matching: An aid to bibliographic search". Communications of the ACM 18 (6): 333–340. doi:10.1145/360825.360855.

[25]“Bio sequence Algorithms, spring 2005 Lecture 4: Set Matching and Aho-Corasick Algorithm.” Kilpelainen, Pekka. 2005. Sept 2014. <http://www.cs.uku.fi/~kilpelai/BSA05/lectures/ slides04.pdf >42

[26]M. Richardson and P. Domingos. Markov logic networks. Machine Learning, 62:107–136, 2006.

[27]Kok, Stanley, and Wen-tau Yih. "Extracting product information from email receipts using markov logic." Proceedings of the Sixth Conference on Email and Anti-Spam, Mountain View, California, USA. 2009.

[28]Boufaden, Narjes, et al. "PEEP-An Information Extraction base approach for Privacy Protection in Email." CEAS. 2005.

[29]“Apache PDFBox – A Java Pdf Library.” The Apache Software Foundation. 2014. Web. Sept2014. https://pdfbox.apache.org/

[30]Wasi, Shaukat, et al. "Event Information Extraction System (EIEE): FSM vs HMM."

[31]Saleem, Ozair, Latif, Seemab. “Information Extraction from Research Papers by Data Integration and Data Validation from Multiple Header Extraction Sources.” WCECS 2012,October 24-26, 2012, San Francisco, USA. http://www.iaeng.org/publication/WCECS2012/WCECS2012_pp215-219.pdf

[32]Chiticariu, Laura, et al. "SystemT: an algebraic approach to declarative information extraction." Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2010.

[33]Almgren, Magnus, and Jenny Berglund. "Information extraction of Seminar information." CS224N: Final Project (2000): 1-12.43

[34]Black, Julie A., and Nisheeth Ranjan. "Automated event extraction from email. "Final Report of CS224N/Ling237 Course in Stanford: http://nlp. stanford. edu/courses/cs224n/2004/, Spring (2004).

[35]“Apache PDFBox 1.8.6 API.” The Apache Software Foundation. 2014. Web. Oct 2014.http://pdfbox.apache.org/docs/18.6/javadocs

[36]Cimiano, Philipp, Günter Ladwig, and Steffen Staab. "Gimme'the context: context-driven automatic semantic annotation with C-PANKOW." Proceedings of the 14th international conference on World Wide Web. ACM, 2005.

[37]Etzioni, O. Cafarella, M. Downey, D. Kok, S. Popescu, A. Shaked, T. Soderland, S. Weld, D. Yates, A.: Web-Scale Information Extraction in Knowitall (Preliminary Results).In WWW’04, 2004, pp. 100–110, http://doi.acm.org/10.1145/988672.988687.

[38]“Precision and Recall.” Wikipedia, the free encyclopedia. Wikimedia Foundation, Inc. 29 October 2014. Web. 31 October 2014. < http://en.wikipedia.org/wiki/Precision_and_recall>

[39]Appavu, Subramanian, Ramasamy Rajaram, M. Muthupandian, G. Athiappan, and K. S. Kashmeera. "Data mining based intelligent analysis of threatening e-mail." Knowledge-Based Systems 22, no. 5 (2009): 392-393.

[40]Shekar, DV Chandra, and S. Sagar Imambi. "Classifying and Identifying of Threats in E-mails–Using Data Mining Techniques." In Proceedings of the International MultiConference of Engineers and Computer Scientists, vol. 1. 2008.

[41]Stolfo, Salvatore J., Shlomo Hershkop, Chia-Wei Hu, Wei-Jen Li, Olivier Nimeskern, and Ke Wang. "Behavior-based modeling and its application to email analysis." ACM Transactions on Internet Technology (TOIT) 6, no. 2 (2006): 187-221.

[42]Kiss, Tibor, and Jan Strunk. "Unsupervised multilingual sentence boundary detection." Computational Linguistics 32, no. 4 (2006): 485-525.

[43]"Penn Treebank Tokenization", https://catalog.ldc.upenn.edu/LDC99T42

[44]Sutton, Charles, Andrew McCallum, and Khashayar Rohanimanesh. "Dynamic conditional random fields: Factorized probabilistic models for labeling and segmenting sequence data." The Journal of Machine Learning Research 8 (2007): 693-723.

International Journal of Information Technology and Computer Science (IJITCS)