Category Specific Prediction Modules for Visual Relation Recognition

Full Text (PDF, 607KB), PP.19-29

Views: 0 Downloads: 0

Author(s)

Sohan Chowdhury 1 Tanbirul Hashan 1 Afif Abdur Rahman 1 A. F. M. Saifuddin Saif 1

1. Department of Computer Science, American International University-Bangladesh, Dhaka, Bangladesh

* Corresponding author.

DOI: https://doi.org/10.5815/ijmsc.2019.02.02

Received: 18 Nov. 2018 / Revised: 10 Jan. 2018 / Accepted: 15 Feb. 2019 / Published: 8 Apr. 2019

Index Terms

Visual Relation Recognition, Deep Learning, Computer Vision

Abstract

Object classification in an image does not provide a complete understanding of the information contained in it. Visual relation information such as “person playing with dog” provides substantially more understanding than just “person, dog”. The visual inter-relations of the objects can provide substantial insight for truly understanding the complete picture. Due to the complex nature of such combinations, conventional computer vision techniques have not been able to show significant promise. Monolithic approaches are lacking in precision and accuracy due to the vastness of possible relation combinations. Solving this problem is crucial to development of advanced computer vision applications that impact every sector of the modern world. We propose a model using recent advances in novel applications of Convolution Neural Networks (Deep Learning) combined with a divide and conquer approach to relation detection. The possible relations are broken down to categories such as spatial (left, right), vehicle-related (riding, driving), etc. Then the task is divided to segmenting the objects, estimating possible relationship category and performing recognition on modules specially built for that relation category. The training process can be done for each module on significantly smaller datasets with less computation required. Additionally this approach provides recall rates that are comparable to state of the art research, while still being precise and accurate for the specific relation categories.

Cite This Paper

Sohan Chowdhury, Tanbirul Hashan, Afif Abdur Rahman, A.F.M. Saifuddin Saif,"Category Specific Prediction Modules for Visual Relation Recognition", International Journal of Mathematical Sciences and Computing(IJMSC), Vol.5, No.2, pp.19-29, 2019. DOI: 10.5815/ijmsc.2019.02.02

Reference

[1]Graves A, Jaitly N. Towards end-to-end speech recognition with recurrent neural networks. In International Conference on Machine Learning 2014 Jan 27 (pp. 1764-1772).

[2]Zhang H, Kyaw Z, Chang SF, Chua TS. Visual translation embedding network for visual relation detection. InCVPR 2017 Jul 1 (Vol. 1, No. 2, p. 5)

[3]Newell A, Deng J. Pixels to graphs by associative embedding. In Advances in neural information processing systems 2017 (pp. 2171-2180).

[4]Ren S, He K, Girshick R, Sun J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems 2015 (pp. 91-99).

[5]Sadeghi MA, Farhadi A. Recognition using visual phrases. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on 2011 Jun 20 (pp. 1745-1752). IEEE.

[6]Xu D, Zhu Y, Choy CB, Fei-Fei L. Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2017 Jul 1 (Vol. 2).

[7]Girshick R, Donahue J, Darrell T, Malik J. Region-based convolutional networks for accurate object detection and segmentation. IEEE transactions on pattern analysis and machine intelligence. 2016 Jan 1; 38(1):142-58.

[8]Baier S, Ma Y, Tresp V. Improving visual relationship detection using semantic modeling of scene descriptions. In International Semantic Web Conference 2017 Oct 21 (pp. 53-68). Springer, Cham.

[9]Lu C, Krishna R, Bernstein M, Fei-Fei L. Visual relationship detection with language priors. In European Conference on Computer Vision 2016 Oct 8 (pp. 852-869). Springer, Cham.

[10]Liang X, Lee L, Xing EP. Deep variation-structured reinforcement learning for visual relationship and attribute detection. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on 2017 Jul 21 (pp. 4408-4417). IEEE

[11]Yao B, Fei-Fei L. Modeling mutual context of object and human pose in human-object interaction activities. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on 2010 Jun 13 (pp. 17-24). IEEE.

[12]Chen T, Yu FX, Chen J, Cui Y, Chen YY, Chang SF. Object-based visual sentiment concept analysis and application. In Proceedings of the 22nd ACM international conference on Multimedia 2014 Nov 3 (pp. 367-376). ACM.

[13]Desai C, Ramanan D, Fowlkes CC. Discriminative models for multi-class object layout. International journal of computer vision. 2011 Oct 1;95(1):1-2.

[14]Ramanathan V, Li C, Deng J, Han W, Li Z, Gu K, Song Y, Bengio S, Rosenberg C, Fei-Fei L. Learning semantic relationships for better action retrieval in images. In Proceedings of the IEEE conference on computer vision and pattern recognition 2015 (pp. 1100-1109).

[15]Rohrbach M, Qiu W, Titov I, Thater S, Pinkal M, Schiele B. Translating video content to natural language descriptions. In Proceedings of the IEEE International Conference on Computer Vision 2013 (pp. 433-440).

[16]GuoDong Z, Jian S, Jie Z, Min Z. Exploring various knowledge in relation extraction. In Proceedings of the 43rd annual meeting on association for computational linguistics 2005 Jun 25 (pp. 427-434). Association for Computational Linguistics

[17]He K, Gkioxari G, Dollár P, Girshick R. Mask r-cnn. InComputer Vision (ICCV), 2017 IEEE International Conference on 2017 Oct 22 (pp. 2980-2988). IEEE.