SUBTLEX-CAT: Subtitle word frequencies and contextual diversity for Catalan

Roger Boada; Marc Guasch; Juan Haro; Josep Demestre; Pilar Ferré

doi:10.3758/s13428-019-01233-1

SUBTLEX-CAT: Subtitle word frequencies and contextual diversity for Catalan

Behav Res Methods. 2020 Feb;52(1):360-375. doi: 10.3758/s13428-019-01233-1.

Authors

Roger Boada¹, Marc Guasch², Juan Haro², Josep Demestre², Pilar Ferré²

Affiliations

¹ Department of Psychology and Research Center for Behavior Assessment, Universitat Rovira i Virgili, Tarragona, Spain. roger.boada@urv.cat.
² Department of Psychology and Research Center for Behavior Assessment, Universitat Rovira i Virgili, Tarragona, Spain.

PMID: 30895456
DOI: 10.3758/s13428-019-01233-1

Abstract

SUBTLEX-CAT is a word frequency and contextual diversity database for Catalan, obtained from a 278-million-word corpus based on subtitles supplied from broadcast Catalan television. Like all previous SUBTLEX corpora, it comprises subtitles from films and TV series. In addition, it includes a wider range of TV shows (e.g., news, documentaries, debates, and talk shows) than has been included in most previous databases. Frequency metrics were obtained for the whole corpus, on the one hand, and only for films and fiction TV series, on the other. Two lexical decision experiments revealed that the subtitle-based metrics outperformed the previously available frequency estimates, computed from either written texts or texts from the Internet. Furthermore, the metrics obtained from the whole corpus were better predictors than the ones obtained from films and fiction TV series alone. In both experiments, the best predictor of response times and accuracy was contextual diversity.

Keywords: Catalan language; Contextual diversity; Subtitles; Word frequency.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Databases, Factual
Humans
Motion Pictures
Spain
Speech*
Television
Time Factors
Writing*