Siamese-Based Architecture for Cross-Lingual Plagiarism Detection in English-Hindi Language Pairs

Basant Agarwal; Mukesh Kumar Gupta; Harish Sharma; Ramesh Chandra Poonia

doi:10.1089/big.2020.0243

Siamese-Based Architecture for Cross-Lingual Plagiarism Detection in English-Hindi Language Pairs

Big Data. 2023 Feb;11(1):48-58. doi: 10.1089/big.2020.0243. Epub 2022 Oct 18.

Authors

Basant Agarwal¹, Mukesh Kumar Gupta², Harish Sharma³, Ramesh Chandra Poonia⁴

Affiliations

¹ Department of Computer Science and Engineering, Indian Institute of Information Technology Kota (IIIT Kota), Jaipur, Rajasthan, India.
² Department of Computer Science and Engineering, Swami Keshvanand Institute of Technology, Management and Gramothan, Jaipur, Rajasthan, India.
³ Department of Computer Science and Engineering, Rajasthan Technical University, Kota, Rajasthan, India.
⁴ Department of Computer Science, CHRIST (Deemed to be University), Bangalore, Karnataka, India.

PMID: 36260373
DOI: 10.1089/big.2020.0243

Abstract

The cross-lingual plagiarism detection (CLPD) is a challenging problem in natural language processing. Cross-lingual plagiarism is when a text is translated from any other language and used as it is without proper acknowledgment. Most of the existing methods provide good results for monolingual plagiarism detection, whereas the performances of existing methods for the CLPD are very limited. The reason for this is that it is difficult to represent the text from two different languages in a common semantic space. In this article, a novel Siamese architecture-based model is proposed to detect the cross-lingual plagiarism in English-Hindi language pairs. The proposed model combines the convolutional neural network (CNN) and bidirectional long short-term memory (Bi-LSTM) network to learn the semantic similarity among the cross-lingual sentences for the English-Hindi language pairs. In the proposed model, the CNN model learns the local context of words, whereas the Bi-LSTM model learns the global context of sentences in forward and backward directions. The performances of the proposed models are evaluated on the benchmark data set, that is, Microsoft paraphrase corpus, which is converted in the English-Hindi language pairs. The proposed model outperforms other models giving 67%, 72%, and 67% weighted average precision, recall, and F1-measure scores. The experimental results show the effectiveness of the proposed models over the baseline models because the proposed model is very efficient in representing the cross-lingual text very efficiently.

Keywords: Bi-LSTM; CNN; Siamese architecture; cross-lingual plagiarism detection; deep learning.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Natural Language Processing
Neural Networks, Computer
Plagiarism*
Semantics*