ChatSubs: A dataset of dialogues in Spanish, Catalan, Basque and Galician extracted from movie subtitles for developing advanced conversational models

Ksenia Kharitonova; Zoraida Callejas; David Pérez-Fernández; Asier Gutiérrez-Fandiño; David Griol

doi:10.1016/j.dib.2023.109565

ChatSubs: A dataset of dialogues in Spanish, Catalan, Basque and Galician extracted from movie subtitles for developing advanced conversational models

Data Brief. 2023 Sep 14:50:109565. doi: 10.1016/j.dib.2023.109565. eCollection 2023 Oct.

Authors

Ksenia Kharitonova¹, Zoraida Callejas^{1

2}, David Pérez-Fernández³, Asier Gutiérrez-Fandiño⁴, David Griol¹

Affiliations

¹ Department of Software Engineering, University of Granada, Granada, Spain.
² Research Centre for Information and Communication Technologies (CITIC-UGR), University of Granada, Granada, Spain.
³ Universidad Autónoma de Madrid, Madrid, Spain.
⁴ LHF Labs, Bilbao, Spain.

Abstract

The ChatSubs dataset [5] contains dialogue data in Spanish and three of Spain's co-official languages (Catalan, Basque, and Galician). It has been obtained from OpenSubtitles, from which we have gathered the movie subtitles in our languages of interest and processed them to generate clearly segmented dialogues and their turns. The data processing code is publicly accessible. The result is 206.706 JSON files with more than 20 million dialogues and 96 million turns, which represents one of the biggest dialogue corpus available, as other similar datasets in better resourced languages do not reach 500k dialogues or present less defined conversations. Thus, the ChatSubs dataset is an ideal resource for research teams that are interested in training dialogue models in Spanish, Catalan, Basque, and Galician.

Keywords: Chatbots; Conversation; Conversational AI; Dialogue; Natural language processing; Speech.