Work place: LITIO laboratory, University of Oran, BP 1524, El-M'Naouer, 31000, Oran, Algeria
E-mail: saidi.imene@univ-oran.dz
Website:
Research Interests: Computer Architecture and Organization, Information Systems, Data Mining, Information Retrieval, Data Structures and Algorithms
Biography
Saidi Imene: PhD candidate in computer science, at Litio Laboratory, University of Oran Es Senia, Algeria. She received 5-year engineering degree in computer science in 2010 and a Master Degree in 2011. Her research interests are data mining, information retrieval and web technology.
By Saidi Imene Nait Bahloul Safia
DOI: https://doi.org/10.5815/ijitcs.2014.09.07, Pub. Date: 8 Aug. 2014
Web information sources such as forums, blogs, and news articles are becoming increasingly large and diverse. Even if advances in technology are helping to improve techniques for dealing with the large amounts of the generated data, such data sources are heterogeneous in structure (semi structured or unstructured sources) and nature (texts or images). Implementation of software solutions is then necessary to prepare data and access these sources in a homogenous way. In this paper we present an approach for indexing heterogeneous data sources. Our objective is to offer techniques for efficient indexing of web sources by storing only the necessary information. We propose automatic indexing for semi structured or unstructured sources (e.g., xml files, html files) and annotation for other sources (e.g., images, videos that exist within a page). We present our algorithms of indexing and propose the use of MapReduce model to build a scalable inverted index. Experiments on a real-world corpus show that our approach achieves a good performance.
[...] Read more.Subscribe to receive issue release notifications and newsletters from MECS Press journals