Estimating the prevalence and diversity of words in written language

Brendan T Johns; Melody Dye; Michael N Jones

doi:10.1177/1747021819897560

Estimating the prevalence and diversity of words in written language

Q J Exp Psychol (Hove). 2020 Jun;73(6):841-855. doi: 10.1177/1747021819897560. Epub 2020 Feb 14.

Authors

Brendan T Johns¹, Melody Dye², Michael N Jones³

Affiliations

¹ Department of Communicative Disorders and Sciences, University at Buffalo, Buffalo, NY, USA.
² University of California, Berkeley, CA, USA.
³ Indiana University Bloomington, Bloomington, IN, USA.

PMID: 31826715
DOI: 10.1177/1747021819897560

Abstract

Recently, a new crowd-sourced language metric has been introduced, entitled word prevalence, which estimates the proportion of the population that knows a given word. This measure has been shown to account for unique variance in large sets of lexical performance. This article aims to build on the work of Brysbaert et al. and Keuleers et al. by introducing new corpus-based metrics that estimate how likely a word is to be an active member of the natural language environment, and hence known by a larger subset of the general population. This metric is derived from an analysis of a newly collected corpus of over 25,000 fiction and non-fiction books and will be shown that it is capable of accounting for significantly more variance than past corpus-based measures.

Keywords: Lexical organisation; big data; corpus studies; semantic diversity.

MeSH terms

Big Data
Humans
Psycholinguistics*
Semantics
Vocabulary*