Level statistics of words: finding keywords in literary texts and symbolic sequences

P Carpena; P Bernaola-Galván; M Hackenberg; A V Coronado; J L Oliver

doi:10.1103/PhysRevE.79.035102

Level statistics of words: finding keywords in literary texts and symbolic sequences

Phys Rev E Stat Nonlin Soft Matter Phys. 2009 Mar;79(3 Pt 2):035102. doi: 10.1103/PhysRevE.79.035102. Epub 2009 Mar 10.

Authors

P Carpena¹, P Bernaola-Galván, M Hackenberg, A V Coronado, J L Oliver

Affiliation

¹ Departamento de Física Aplicada II, Universidad de Málaga, 29071 Málaga, Spain.

PMID: 19392005
DOI: 10.1103/PhysRevE.79.035102

Abstract

Using a generalization of the level statistics analysis of quantum disordered systems, we present an approach able to extract automatically keywords in literary texts. Our approach takes into account not only the frequencies of the words present in the text but also their spatial distribution along the text, and is based on the fact that relevant words are significantly clustered (i.e., they self-attract each other), while irrelevant words are distributed randomly in the text. Since a reference corpus is not needed, our approach is especially suitable for single documents for which no a priori information is available. In addition, we show that our method works also in generic symbolic sequences (continuous texts without spaces), thus suggesting its general applicability.

Publication types

Research Support, Non-U.S. Gov't