IJISA Vol. 14, No. 1, 8 Feb. 2022
Cover page and Table of Contents: PDF (size: 635KB)
Full Text (PDF, 635KB), PP.42-56
Views: 0 Downloads: 0
Data quality, IBM, artificial intelligence
A huge amount of data is produced in every domain these days. Thus for applying automation on any dataset, the appropriately trained data plays an important role in achieving efficient and accurate results. According to data researchers, data scientists spare 80% of their time in preparing and organizing the data. To overcome this tedious task, IBM Research has developed a Data Quality for AI tool, which has varieties of metrics that can be applied to different datasets (in .csv format) to identify the quality of data. In this paper, we will be representing how the IBM API toolkit will be useful for different variants of datasets and showcase the results for each metrics in graphical form. This paper might be found useful for the readers to understand the working flow of the IBM data purifier tool, thus we have represented the entire flow of how to use IBM data quality for the AI toolkit in the form of architecture.
Ankur Jariwala, Aayushi Chaudhari, Chintan Bhatt, Dac-Nhuong Le, "Data Quality for AI Tool: Exploratory Data Analysis on IBM API", International Journal of Intelligent Systems and Applications(IJISA), Vol.14, No.1, pp.42-56, 2022. DOI: 10.5815/ijisa.2022.01.04
[1] Wang, R. Y., Ziad, M., & Lee, Y. W. (2006). Data quality (Vol. 23). Springer Science & Business Media.
[2] Zahedi, Z., & Costas, R. (2018). General discussion of data quality challenges in social media metrics: Extensive comparison of four major altmetric data aggregators. PloS one, 13(5), e0197326.
[3] Alves, V. M., Auerbach, S. S., Kleinstreuer, N., Rooney, J. P., Muratov, E. N., Rusyn, I., ... & Schmitt, C. (2021). Curated data in—trustworthy in silico models out: The impact of data quality on the reliability of artificial intelligence models as alternatives to animal testing. Alternatives to Laboratory Animals, 02611929211029635.
[4] Elmore, J. G., & Lee, C. I. (2021). Data Quality, Data Sharing, and Moving Artificial Intelligence Forward. JAMA Network Open, 4(8), e2119345-e2119345.
[5] Bertossi, L., & Geerts, F. (2020). Data quality and explainable AI. Journal of Data and Information Quality (JDIQ), 12(2), 1-9.
[6] Vayghan, J. A., Garfinkle, S. M., Walenta, C., Healy, D. C., & Valentin, Z. (2007). The internal information transformation of IBM. IBM Systems Journal, 46(4), 669-683.
[7] Bisong, E. (2019). Introduction to Scikit-learn. In Building Machine Learning and Deep Learning Models on Google Cloud Platform (pp. 215-229). Apress, Berkeley, CA.
[8] Svendsen, S. M. (2021). In Search of Lost Time: A Deep Dive in Overlapping Computation and Communication in Memory Bound MPI Applications (Master's thesis).
[9] Shung, K. P. (2018). Accuracy, precision, recall or F1. Towards data science.
[10]Torgo, L., & Ribeiro, R. (2009, October). Precision and recall for regression. In International Conference on Discovery Science (pp. 332-346). Springer, Berlin, Heidelberg.
[11]Crawford, S. L. (2006). Correlation and regression. Circulation, 114(19), 2083-2088.
[12]Artasanchez, A., & Joshi, P. (2020). Artificial Intelligence with Python: Your complete guide to building intelligent apps using Python 3. x. Packt Publishing Ltd.
[13] Badr, W. (2019). Why Feature Correlation Matters.... A Lot!. Towards Data Science.
[14] Santoyo, S. (2017). A brief overview of outlier detection techniques. Towards data science.
[15] Reichart, R., & Rappoport, A. (2009, June). The NVI clustering evaluation measure. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009) (pp. 165-173).
[16] Raschka, S., Julian, D., & Hearty, J. (2016). Python: deeper insights into machine learning. Packt Publishing Ltd.
[17] Li, G., Zhou, X., & Cao, L. (2021). Machine learning for databases. Proc. VLDB Endow, 14(12), 3190-3193.
[18] Zhong, S., Zhang, K., Bagheri, M., Burken, J. G., Gu, A., Li, B., ... & Zhang, H. (2021). Machine Learning: New Ideas and Tools in Environmental Science and Engineering. Environmental Science & Technology.
[19] Raschka, S. (2015). Python machine learning. Packt publishing ltd.
[20] Dataset bill_authenticatśion.csv: https://www.kaggle.com/c178angshumaankesh/bill-authentication?select=bill_authentication.csv
[21] Dataset Admission_Predict_Ver1.1.csv: https://www.kaggle.com/shabiransari/input-admission-predict-ver1-1-csv/data?select=Admission_Predict_Ver1.1.csv
[22] Dataset Fish.csv: https://www.kaggle.com/aungpyaeap/fish-market?select=Fish.csv
[23] Dataset titanic.csv: https://www.kaggle.com/c/titanic/data
[24] Dataset blood-transfusion-service-center.csv: https://www.kaggle.com/ninalabiba/blood-transfusion-dataset?select=transfusion.csv
[25] Dataset thyroid_data.csv: https://www.kaggle.com/dilippuripuri/thyroidcsv?select=thyroid.csv
[26] Other Datasets for Graph Visualizations: https://www.kaggle.com/datasets
[27] Data Quality for AI API - Data Quality for AI API.
[28] Data Quality for AI – IBM Developer - Learning Path.
[29] Doss, S., Paranthaman, J., Gopalakrishnan, S., Duraisamy, A., Pal, S., Duraisamy, B., ... & Le, D. N. (2021). Memetic Optimization with Cryptographic Encryption for Secure Medical Data Transmission in IoT-Based Distributed Systems. CMC-COMPUTERS MATERIALS & CONTINUA, 66(2), 1577-1594.
[30] Gaur, L., Afaq, A., Solanki, A., Singh, G., Sharma, S., Jhanjhi, N. Z., ... & Le, D. N. (2021). Capitalizing on big data and revolutionary 5G technology: extracting and visualizing ratings and reviews of global chain hotels. Computers & Electrical Engineering, 95, 107374.
[31] Le, D. N., Parvathy, V. S., Gupta, D., Khanna, A., Rodrigues, J. J., & Shankar, K. (2021). IoT enabled depthwise separable convolution neural network with deep support vector machine for COVID-19 diagnosis and classification. International journal of machine learning and cybernetics, 1-14.