FER-PCVT: Facial Expression Recognition with Patch-Convolutional Vision Transformer for Stroke Patients

Yiming Fan; Hewei Wang; Xiaoyu Zhu; Xiangming Cao; Chuanjian Yi; Yao Chen; Jie Jia; Xiaofeng Lu

doi:10.3390/brainsci12121626

FER-PCVT: Facial Expression Recognition with Patch-Convolutional Vision Transformer for Stroke Patients

Brain Sci. 2022 Nov 28;12(12):1626. doi: 10.3390/brainsci12121626.

Authors

Yiming Fan¹, Hewei Wang², Xiaoyu Zhu¹, Xiangming Cao³, Chuanjian Yi⁴, Yao Chen⁵, Jie Jia², Xiaofeng Lu^{1

6}

Affiliations

¹ School of Communication and Information Engineering, Shanghai University, Shanghai 200444, China.
² Department of Rehabilitation, Huashan Hospital, Fudan University, Shanghai 200040, China.
³ Department of Oncology, Jiangyin People's Hospital Affiliated to Nantong University, Wuxi 214400, China.
⁴ Department of Rehabilitation, The Affiliated Hospital of Qingdao University, Qingdao 266000, China.
⁵ Department of Rehabilitation, Shanghai Third Rehabilitation Hospital, Shanghai 200436, China.
⁶ Wenzhou Institute, Shanghai University, Wenzhou 325000, China.

Abstract

Early rehabilitation with the right intensity contributes to the physical recovery of stroke survivors. In clinical practice, physicians determine whether the training intensity is suitable for rehabilitation based on patients' narratives, training scores, and evaluation scales, which puts tremendous pressure on medical resources. In this study, a lightweight facial expression recognition algorithm is proposed to diagnose stroke patients' training motivations automatically. First, the properties of convolution are introduced into the Vision Transformer's structure, allowing the model to extract both local and global features of facial expressions. Second, the pyramid-shaped feature output mode in Convolutional Neural Networks is also introduced to reduce the model's parameters and calculation costs significantly. Moreover, a classifier that can better classify facial expressions of stroke patients is designed to improve performance further. We verified the proposed algorithm on the Real-world Affective Faces Database (RAF-DB), the Face Expression Recognition Plus Dataset (FER+), and a private dataset for stroke patients. Experiments show that the backbone network of the proposed algorithm achieves better performance than Pyramid Vision Transformer (PvT) and Convolutional Vision Transformer (CvT) with fewer parameters and Floating-point Operations Per Second (FLOPs). In addition, the algorithm reaches an 89.44% accuracy on the RAF-DB dataset, which is higher than other recent studies. In particular, it obtains an accuracy of 99.81% on the private dataset, with only 4.10M parameters.

Keywords: convolutional neural networks (CNNs); facial expression recognition (FER); rehabilitation; stroke; vision transformer (ViT).

Abstract

Grants and funding