Data structure set-trie for storing and querying sets: Theoretical and empirical analysis

Iztok Savnik; Mikita Akulich; Matjaž Krnc; Riste Škrekovski

doi:10.1371/journal.pone.0245122

Data structure set-trie for storing and querying sets: Theoretical and empirical analysis

PLoS One. 2021 Feb 10;16(2):e0245122. doi: 10.1371/journal.pone.0245122. eCollection 2021.

Authors

Iztok Savnik^{1

2}, Mikita Akulich¹, Matjaž Krnc^{1

2

3}, Riste Škrekovski^{1

2

3}

Affiliations

¹ Faculty of Mathematics, Natural Sciences and Information Technologies, University of Primorska, Koper, Slovenia.
² Faculty of Information Studies, Novo Mesto, Slovenia.
³ Faculty of Mathematics and Physics, University of Ljubljana, Ljubljana, Slovenia.

Abstract

Set containment operations form an important tool in various fields such as information retrieval, AI systems, object-relational databases, and Internet applications. In the paper, a set-trie data structure for storing sets is considered, along with the efficient algorithms for the corresponding set containment operations. We present the mathematical and empirical study of the set-trie. In the mathematical study, the relevant upper-bounds on the efficiency of its expected performance are established by utilizing a natural probabilistic model. In the empirical study, we give insight into how different distributions of input data impact the efficiency of set-trie. Using the correct parameters for those randomly generated datasets, we expose the key sources of the input sensitivity of set-trie. Finally, the empirical comparison of set-trie with the inverted index is based on the real-world datasets containing sets of low cardinality. The comparison shows that the running time of set-trie consistently outperforms the inverted index by orders of magnitude.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Artificial Intelligence
Databases, Factual
Datasets as Topic
Information Storage and Retrieval / methods*
Models, Theoretical

Grants and funding

The authors acknowledge the financial support from the Slovenian Research Agency (research core funding No. P1-00383). The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.