A versatile dataset for intrinsic plagiarism detection, text reuse analysis, and author clustering in Urdu

Muhammad Haseeb; Muhammad Faraz Manzoor; Muhammad Shoaib Farooq; Uzma Farooq; Adnan Abid

doi:10.1016/j.dib.2023.109857

A versatile dataset for intrinsic plagiarism detection, text reuse analysis, and author clustering in Urdu

Data Brief. 2023 Nov 26:52:109857. doi: 10.1016/j.dib.2023.109857. eCollection 2024 Feb.

Authors

Muhammad Haseeb¹, Muhammad Faraz Manzoor¹, Muhammad Shoaib Farooq¹, Uzma Farooq¹, Adnan Abid²

Affiliations

¹ Department of Computer Science, University of Management and Technology, Lahore, Pakistan.
² Department of Data Science, Faculty of Computing and Information Technology, University of the Punjab, Pakistan.

Abstract

Plagiarism detection (PD) is a process of identifying instances where someone has presented another person's work or ideas as their own. Plagiarism detection is categorized into two types (i) Intrinsic plagiarism detection primarily concerns the assessment of authorship consistency within a single document, aiming to identify instances where portions of the text may have been copied or paraphrased from elsewhere within the same document. Author clustering, closely related to intrinsic plagiarism detection, involves grouping documents based on their stylistic and linguistic characteristics to identify common authors or sources within a given dataset. On the other hand, (ii) extrinsic plagiarism detection delves into the comparative analysis of a suspicious document against a set of external source documents, seeking instances of shared phrases, sentences, or paragraphs between them, which is often referred to as text reuse or verbatim copying. Detection of plagiarism from documents is a long-established task in the area of NLP with remarkable contributions in multiple applications. A lot of research has already been conducted in the English and other foreign languages but Urdu language needs a lot of attention especially in intrinsic plagiarism detection domain. The major reason is that Urdu is a low resource language and unfortunately there is no high-quality benchmark corpus available for intrinsic plagiarism detection in Urdu language. This study presents a high-quality benchmark Corpus comprising 10,872 documents. The corpus is structured into two granularity levels: sentence level and paragraph level. This dataset serves multifaceted purposes, facilitating intrinsic plagiarism detection, verbatim text reuse identification, and author clustering in the Urdu language. Also, it holds significance for natural language processing researchers and practitioners as it facilitates the development of specialized plagiarism detection models tailored to the Urdu language. These models can play a vital role in education and publishing by improving the accuracy of plagiarism detection, effectively addressing a gap and enhancing the overall ability to identify copied content in Urdu writing.

Keywords: Intrinsic plagiarism; Paragraph; Plagiarism detection; Sentence; Stylometry features; Urdu language.