Preview

Proceedings of the Southwest State University

Advanced search

A Method for Tracking the Processes of User Interaction with Objects in Video Sequences

https://doi.org/10.21869/2223-1560-2021-25-4-177-200

Abstract

Purpose of research. When developing cyber-physical systems and intelligent environment designed to analyze user activity, the task of tracking user interactions with objects is topical.

Methods. In this paper, to solve the problem of detecting and tracking user interactions with objects in video sequences, an appropriate method based on the combined application of neural network models for detecting objects and segmenting users, as well as constructing depth maps based on video sequence frames was developed. The study presents corresponding algorithms and algorithmic models. The testing and assessment of the quality of functioning of the developed method were carried out on the basis of a test data set including 1,000 video sequences with a duration of up to 20 seconds.

Results. In the course of experimental studies, the indicators of accuracy (accuracy, recall, precision) of detecting interactions for video sequences with illumination levels of 100% and 50% were determined, which amounted to {0.82, 0.78, 0.76} and {0.70, 0.59, 0.70}, respectively, while the average proportions of correctly tracked interactions for these sets of video sequences had values of 81% and 71%. According to the results of the conducted testing, the developed method provides an opportunity to detect and track user interactions with objects in real time, including those in conditions of the lack of illumination of the scene.

Conclusion. Based on the results of the testing of the proposed method of tracking user interactions with objects on a test set of 1,000 video sequences, the proposed solution showed a fairly high quality of detecting and tracking interactions in video sequences with illumination levels of 100% and 50%. Thus, the developed method is to a certain extent resistant to changes in the level of illumination of the scene and provides a successful solution to the problem of detecting and tracking user interaction with various classes of objects in the video sequence, without requiring the use of specialized equipment.

About the Author

R. N. Iakovlev
St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS); St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences
Russian Federation

 Roman N. Iakovlev, Junior Researcher of Laboratory of Big Data Technologies in Socio-Cyberphysical Systems 

39, 14th Line, St. Petersburg 199178 



References

1. Krinski B.A., Ruiz D.V., Machado G.Z., Todt E. Masking salient object detection, a mask region-based convolutional neural network analysis for segmentation of salient objects. Latin American Robotics Symposium (LARS), 2019 Brazilian Symposium on Robotics (SBR) and 2019 Workshop on Robotics in Education (WRE), IEEE, 2019, pp. 55-60.

2. He K., Girshick R., Dollár P. Rethinking imagenet pre-training. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 4918-4927.

3. Coco common objects in context [Electronic resource]. Detection Leaderboard [site]. Available at: https://cocodataset.org/#detection-leaderboard.

4. Lin T. Y. et al. Microsoft coco: Common objects in context. European conference on computer vision, 2014, pp. 740-755.

5. Du X., Lin T.Y., Jin P., Ghiasi G., Tan M., Cui Y., Song X. SpineNet: Learning scalepermuted backbone for recognition and localization. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11592-11601.

6. Gkioxari G., Girshick R., Dollár P., He K. Detecting and recognizing human-object interactions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8359-8367.

7. Shen L., Yeung S., Hoffman J., Mori G., Fei-Fei L. Scaling human-object interaction recognition through zero-shot learning. IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, 2018, pp. 1568-1576.

8. Cao Z., Hidalgo G., Simon T., Wei S. E., Sheikh Y. OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields. IEEE transactions on pattern analysis and machine intelligence, 2019, vol. 43, no. 1, pp. 172-186.

9. Rogez G., Khademi M., Supancic III, J., Montiel J.M.M., Ramanan D. 3d hand pose detection in egocentric rgb-d images. European Conference on Computer Vision, Springer, Cham, 2014, pp. 356-371.

10. Keskin C., Kıraç F., Kara Y.E., Akarun L. Hand pose estimation and hand shape classification using multi-layered randomized decision forests. European Conference on Computer Vision. Springer. Berlin, Heidelberg, 2012, pp. 852-863.

11. Oberweger M., Wohlhart P., Lepetit V. Hands deep in deep learning for hand pose estimation. Proceedings of 20th Computer Vision Winter Workshop (CVWW), 2015, pp. 21-30. arXiv preprint arXiv:1502.06807.

12. Garcia-Hernando G., Yuan S., Baek S., Kim T. K. First-person hand action benchmark with rgb-d videos and 3d hand pose annotations. Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 409-419.

13. Moon G., Chang J.Y., Lee K.M. V2v-posenet: Voxel-to-voxel prediction network for accurate 3d hand and human pose estimation from a single depth map. Proceedings of the IEEE conference on computer vision and pattern Recognition, 2018, pp. 5079-5088.

14. Redmon J., Farhadi A. Yolov 3: An incremental improvement. arXiv preprint arXiv:1804.02767. 2018.

15. Murawski K., Murawska M., Pustelny T. Optimizing the light source design for a sensor to measure the stroke volume of the artificial heart. 13th Conference on Integrated Optics: Sensors, Sensing Structures, and Methods. International Society for Optics and Photonics, 2018, vol. 10830, 1083006 p.

16. Karsch K., Liu C., Kang S. B. Depth extraction from video using non-parametric sampling. European conference on computer vision. Springer, Berlin, Heidelberg, 2012, pp. 775-788.

17. Eigen D., Fergus R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. Proceedings of the IEEE international conference on computer vision, 2015, pp. 2650-2658.

18. Laina I., Rupprecht C., Belagiannis V., Tombari F., Navab N. Deeper depth prediction with fully convolutional residual networks. 2016 Fourth international conference on 3D vision (3DV). IEEE, 2016, pp. 239-248.

19. He K., Zhang X., Ren S., Sun J. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778.

20. Cheong Y. Z., Chew W. J. The Application of Image Processing to Solve Occlusion Issue in Object Tracking. MATEC Web of Conferences, EDP Sciences, 2018, vol. 152, 03001 p.

21. Ning G., Zhang Z., Huang C., Ren X., Wang H., Cai C., He Z. Spatially supervised recur-rent convolutional neural networks for visual object tracking. 2017 IEEE Inter-national Symposium on Circuits and Systems (IS-CAS), IEEE, 2017, pp. 1-4.

22. Feng Q., Ablavsky V., Bai Q., Li G., Sclaroff S. Real-time visual object tracking with natural language description. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2020, pp. 700-709.

23. Zhang D., Maei H., Wang X., Wang Y.F. Deep reinforcement learning for visual object tracking in videos. arXiv preprint arXiv:1701.08936. 2017.

24. Redmon J., Divvala S., Girshick R., Farhadi A. You only look once: Unified, realtime object detection. Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779-788.


Review

For citations:


Iakovlev R.N. A Method for Tracking the Processes of User Interaction with Objects in Video Sequences. Proceedings of the Southwest State University. 2021;25(4):177-200. (In Russ.) https://doi.org/10.21869/2223-1560-2021-25-4-177-200

Views: 237


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2223-1560 (Print)
ISSN 2686-6757 (Online)