On the fractal patterns of language structures

Leonardo Costa Ribeiro; Américo Tristão Bernardes; Heliana Mello

doi:10.1371/journal.pone.0285630

On the fractal patterns of language structures

PLoS One. 2023 May 18;18(5):e0285630. doi: 10.1371/journal.pone.0285630. eCollection 2023.

Authors

Leonardo Costa Ribeiro¹, Américo Tristão Bernardes², Heliana Mello³

Affiliations

¹ Departamento de Ciências Econômicas, Faculdade de Ciências Econômicas, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brasil.
² Departamento de Física, Instituto de Ciências Exatas e Biológicas, Universidade Federal de Ouro Preto, Ouro Preto, Minas Gerais, Brasil.
³ Faculdade de Letras, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brasil.

Abstract

Natural Language Processing (NLP) makes use of Artificial Intelligence algorithms to extract meaningful information from unstructured texts, i.e., content that lacks metadata and cannot easily be indexed or mapped onto standard database fields. It has several applications, from sentiment analysis and text summary to automatic language translation. In this work, we use NLP to figure out similar structural linguistic patterns among several different languages. We apply the word2vec algorithm that creates a vector representation for the words in a multidimensional space that maintains the meaning relationship between the words. From a large corpus we built this vectorial representation in a 100-dimensional space for English, Portuguese, German, Spanish, Russian, French, Chinese, Japanese, Korean, Italian, Arabic, Hebrew, Basque, Dutch, Swedish, Finnish, and Estonian. Then, we calculated the fractal dimensions of the structure that represents each language. The structures are multi-fractals with two different dimensions that we use, in addition to the token-dictionary size rate of the languages, to represent the languages in a three-dimensional space. Finally, analyzing the distance among languages in this space, we conclude that the closeness there is tendentially related to the distance in the Phylogenetic tree that depicts the lines of evolutionary descent of the languages from a common ancestor.

Copyright: © 2023 Ribeiro et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Artificial Intelligence*
Fractals*
Language
Natural Language Processing
Phylogeny
Translating

Grants and funding

This work was partly supported by the Brazilian agencies CNPq (307633/2019-5 and 312020/2021-0) and PRPq-UFMG. There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.