Testing the Relationship between Word Length, Frequency, and Predictability Based on the German Reference Corpus

Alexander Koplenig; Marc Kupietz; Sascha Wolfer

doi:10.1111/cogs.13090

Testing the Relationship between Word Length, Frequency, and Predictability Based on the German Reference Corpus

Cogn Sci. 2022 Jun;46(6):e13090. doi: 10.1111/cogs.13090.

Authors

Alexander Koplenig¹, Marc Kupietz², Sascha Wolfer¹

Affiliations

¹ Department of Lexical Studies, Leibniz-Institute for the German Language (IDS).
² Department of Digital Linguistics, Leibniz-Institute for the German Language (IDS).

PMID: 35661231
DOI: 10.1111/cogs.13090

Abstract

In a recent article, Meylan and Griffiths (Meylan & Griffiths, 2021, henceforth, M&G) focus their attention on the significant methodological challenges that can arise when using large-scale linguistic corpora. To this end, M&G revisit a well-known result of Piantadosi, Tily, and Gibson (2011, henceforth, PT&G) who argue that average information content is a better predictor of word length than word frequency. We applaud M&G who conducted a very important study that should be read by any researcher interested in working with large-scale corpora. The fact that M&G mostly failed to find clear evidence in favor of PT&G's main finding motivated us to test PT&G's idea on a subset of the largest archive of German language texts designed for linguistic research, the German Reference Corpus consisting of ∼43 billion words. We only find very little support for the primary data point reported by PT&G.

Keywords: Compression; Corpus linguistics; Information theory; Large-scale corpora; N-gram modeling; Uniform information density.

Publication types

Letter

MeSH terms

Humans
Language*
Linguistics*
Reading