Estimating the prevalence and diversity of words in written language

Q J Exp Psychol (Hove). 2020 Jun;73(6):841-855. doi: 10.1177/1747021819897560. Epub 2020 Feb 14.

Abstract

Recently, a new crowd-sourced language metric has been introduced, entitled word prevalence, which estimates the proportion of the population that knows a given word. This measure has been shown to account for unique variance in large sets of lexical performance. This article aims to build on the work of Brysbaert et al. and Keuleers et al. by introducing new corpus-based metrics that estimate how likely a word is to be an active member of the natural language environment, and hence known by a larger subset of the general population. This metric is derived from an analysis of a newly collected corpus of over 25,000 fiction and non-fiction books and will be shown that it is capable of accounting for significantly more variance than past corpus-based measures.

Keywords: Lexical organisation; big data; corpus studies; semantic diversity.

MeSH terms

  • Big Data
  • Humans
  • Psycholinguistics*
  • Semantics
  • Vocabulary*