Evading obscure communication from spam emails

Khan Farhan Rafat; Qin Xin; Abdul Rehman Javed; Zunera Jalil; Rana Zeeshan Ahmad

doi:10.3934/mbe.2022091

Evading obscure communication from spam emails

Math Biosci Eng. 2022 Jan;19(2):1926-1943. doi: 10.3934/mbe.2022091. Epub 2021 Dec 22.

Authors

Khan Farhan Rafat¹, Qin Xin², Abdul Rehman Javed¹, Zunera Jalil¹, Rana Zeeshan Ahmad³

Affiliations

¹ Department of Cyber Security, Faculty of Computing and AI, Air University, PAF Complex, E-9, Islamabad, Pakistan.
² Faculty of Science and Technology, University of the Faroe Islands, Vestarabryggja 15, FO 100, Torshavn, Faroe Islands.
³ Department of Information Technology, University of Sialkot, Pakistan.

PMID: 35135236
DOI: 10.3934/mbe.2022091

Abstract

Spam is any form of annoying and unsought digital communication sent in bulk and may contain offensive content feasting viruses and cyber-attacks. The voluminous increase in spam has necessitated developing more reliable and vigorous artificial intelligence-based anti-spam filters. Besides text, an email sometimes contains multimedia content such as audio, video, and images. However, text-centric email spam filtering employing text classification techniques remains today's preferred choice. In this paper, we show that text pre-processing techniques nullify the detection of malicious contents in an obscure communication framework. We use Spamassassin corpus with and without text pre-processing and examined it using machine learning (ML) and deep learning (DL) algorithms to classify these as ham or spam emails. The proposed DL-based approach consistently outperforms ML models. In the first stage, using pre-processing techniques, the long-short-term memory (LSTM) model achieves the highest results of 93.46% precision, 96.81% recall, and 95% F1-score. In the second stage, without using pre-processing techniques, LSTM achieves the best results of 95.26% precision, 97.18% recall, and 96% F1-score. Results show the supremacy of DL algorithms over the standard ones in filtering spam. However, the effects are unsatisfactory for detecting encrypted communication for both forms of ML algorithms.

Keywords: Email classification; ham; machine learning; spam; stenography; text pre-processing.

MeSH terms

Algorithms
Artificial Intelligence*
Communication
Electronic Mail*
Machine Learning