Dataset for studying gender disparity in English literary texts

Data Brief. 2022 Feb 2:41:107905. doi: 10.1016/j.dib.2022.107905. eCollection 2022 Apr.

Abstract

Recent discourse has highlighted significant gender disparity in many aspects of economic, social and cultural life. With the advent of advanced tools in Artificial Intelligence (AI) and Natural Language Processing (NLP), there is an opportunity to use computational and digital tools to analyze corpora, such as copyright-expired literature in the pre-modern period (defined herein as books published approximately between 1800 and 1950) from the Project Gutenberg corpus. Nevertheless, there are challenges in using such tools, especially for maintaining high-enough quality to explore interesting hypotheses. We present a dataset and materials that illustrate how modern processes in NLP can be used on the raw text of more than 3,000 literary texts in Project Gutenberg to (i) extract characters and pronouns from the text with high quality, (ii) disambiguate characters so that they are not overcounted, (iii) detect the gender of each character. Furthermore, we also used manual labeling to determine the genders of authors who have published these texts, and published the labels as part of the dataset to facilitate future digital humanities research.

Keywords: Digital humanities; Gender Disparity; Natural language processing; Text analytics.