Using cascade CNN-LSTM-FCNs to identify AI-altered video based on eye state sequence

Muhammad Salihin Saealal; Mohd Zamri Ibrahim; David J Mulvaney; Mohd Ibrahim Shapiai; Norasyikin Fadilah

doi:10.1371/journal.pone.0278989

Using cascade CNN-LSTM-FCNs to identify AI-altered video based on eye state sequence

PLoS One. 2022 Dec 15;17(12):e0278989. doi: 10.1371/journal.pone.0278989. eCollection 2022.

Authors

Muhammad Salihin Saealal^{1

2}, Mohd Zamri Ibrahim¹, David J Mulvaney³, Mohd Ibrahim Shapiai⁴, Norasyikin Fadilah¹

Affiliations

¹ Faculty of Electrical and Electronics Engineering Technology, Universiti Malaysia Pahang, Pekan Campus, Pekan, Pahang, Malaysia.
² Electrical Engineering Technology Department, Faculty of Electric and Electronic Engineering Technology, Universiti Teknikal Malaysia Melaka, Durian Tunggal, Melaka, Malaysia.
³ School of Electronic, Electrical and Systems Engineering, Loughborough University, Loughborough, United Kingdom.
⁴ Centre for Artificial Intelligence and Robotics, Malaysia-Japan International Institue of Technology, Universiti Teknologi Malaysia, Kuala Lumpur, Malaysia.

Abstract

Deep learning is notably successful in data analysis, computer vision, and human control. Nevertheless, this approach has inevitably allowed the development of DeepFake video sequences and images that could be altered so that the changes are not easily or explicitly detectable. Such alterations have been recently used to spread false news or disinformation. This study aims to identify Deepfaked videos and images and alert viewers to the possible falsity of the information. The current work presented a novel means of revealing fake face videos by cascading the convolution network with recurrent neural networks and fully connected network (FCN) models. The system detection approach utilizes the eye-blinking state in temporal video frames. Notwithstanding, it is deemed challenging to precisely depict (i) artificiality in fake videos and (ii) spatial information within the individual frame through this physiological signal. Spatial features were extracted using the VGG16 network and trained with the ImageNet dataset. The temporal features were then extracted in every 20 sequences through the LSTM network. On another note, the pre-processed eye-blinking state served as a probability to generate a novel BPD dataset. This newly-acquired dataset was fed to three models for training purposes with each entailing four, three, and six hidden layers, respectively. Every model constitutes a unique architecture and specific dropout value. Resultantly, the model optimally and accurately identified tampered videos within the dataset. The study model was assessed using the current BPD dataset based on one of the most complex datasets (FaceForensic++) with 90.8% accuracy. Such precision was successfully maintained in datasets that were not used in the training process. The training process was also accelerated by lowering the computation prerequisites.

Copyright: © 2022 Saealal et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Humans
Neural Networks, Computer*
Probability

Grants and funding

This research is financially supported by the Fundamental Research Grant Scheme (FRGS/1/2021/ICT07/UMP/02/1) with the RDU number RDU210136 which is awarded by the Ministry of Higher Education (MOHE) via the Research and Innovation Department, Universiti Malaysia Pahang (UMP) Malaysia. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.