Convolutional Neural Networks or Vision Transformers: Who Will Win the Race for Action Recognitions in Visual Data?

Oumaima Moutik; Hiba Sekkat; Smail Tigani; Abdellah Chehri; Rachid Saadane; Taha Ait Tchakoucht; Anand Paul

doi:10.3390/s23020734

Convolutional Neural Networks or Vision Transformers: Who Will Win the Race for Action Recognitions in Visual Data?

Sensors (Basel). 2023 Jan 9;23(2):734. doi: 10.3390/s23020734.

Authors

Oumaima Moutik¹, Hiba Sekkat¹, Smail Tigani¹, Abdellah Chehri², Rachid Saadane³, Taha Ait Tchakoucht¹, Anand Paul⁴

Affiliations

¹ Engineering Unit, Euromed Research Center, Euro-Mediterranean University, Fes 30030, Morocco.
² Department of Mathematics and Computer Science, Royal Military College of Canada, Kingston, ON 11 K7K 7B4, Canada.
³ SIRC-LaGeS, Hassania School of Public Works, Casablanca 8108, Morocco.
⁴ School of Computer Science and Engineering, Kyungpook National University, Daegu 41566, Republic of Korea.

Abstract

Understanding actions in videos remains a significant challenge in computer vision, which has been the subject of several pieces of research in the last decades. Convolutional neural networks (CNN) are a significant component of this topic and play a crucial role in the renown of Deep Learning. Inspired by the human vision system, CNN has been applied to visual data exploitation and has solved various challenges in various computer vision tasks and video/image analysis, including action recognition (AR). However, not long ago, along with the achievement of the transformer in natural language processing (NLP), it began to set new trends in vision tasks, which has created a discussion around whether the Vision Transformer models (ViT) will replace CNN in action recognition in video clips. This paper conducts this trending topic in detail, the study of CNN and Transformer for Action Recognition separately and a comparative study of the accuracy-complexity trade-off. Finally, based on the performance analysis's outcome, the question of whether CNN or Vision Transformers will win the race will be discussed.

Keywords: action recognition; action recognitions; conversational systems; convolutional neural networks; natural language understanding; recurrent neural networks; vision transformers.

Publication types

Review

MeSH terms

Computers
Humans
Image Processing, Computer-Assisted / methods
Neural Networks, Computer*
Recognition, Psychology
Vision, Ocular*

Grants and funding

This research received no external funding.