SUBTLEX-CAT: Subtitle word frequencies and contextual diversity for Catalan

Behav Res Methods. 2020 Feb;52(1):360-375. doi: 10.3758/s13428-019-01233-1.

Abstract

SUBTLEX-CAT is a word frequency and contextual diversity database for Catalan, obtained from a 278-million-word corpus based on subtitles supplied from broadcast Catalan television. Like all previous SUBTLEX corpora, it comprises subtitles from films and TV series. In addition, it includes a wider range of TV shows (e.g., news, documentaries, debates, and talk shows) than has been included in most previous databases. Frequency metrics were obtained for the whole corpus, on the one hand, and only for films and fiction TV series, on the other. Two lexical decision experiments revealed that the subtitle-based metrics outperformed the previously available frequency estimates, computed from either written texts or texts from the Internet. Furthermore, the metrics obtained from the whole corpus were better predictors than the ones obtained from films and fiction TV series alone. In both experiments, the best predictor of response times and accuracy was contextual diversity.

Keywords: Catalan language; Contextual diversity; Subtitles; Word frequency.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Databases, Factual
  • Humans
  • Motion Pictures
  • Spain
  • Speech*
  • Television
  • Time Factors
  • Writing*