Protein Construction-Based Data Partitioning Scheme for Alignment of Protein Macromolecular Structures Through Distributed Querying in Federated Databases

Dariusz Mrozek; Jacek Kwiendacz; Bozena Malysiak-Mrozek

doi:10.1109/TNB.2019.2930494

Protein Construction-Based Data Partitioning Scheme for Alignment of Protein Macromolecular Structures Through Distributed Querying in Federated Databases

IEEE Trans Nanobioscience. 2020 Jan;19(1):102-116. doi: 10.1109/TNB.2019.2930494. Epub 2019 Jul 22.

Authors

Dariusz Mrozek, Jacek Kwiendacz, Bozena Malysiak-Mrozek

PMID: 31329125
DOI: 10.1109/TNB.2019.2930494

Abstract

Exploration of various characteristics of 3D protein structures through querying relational databases storing the structures can be challenging due to the necessity to conform to a particular database schema. However, this also brings several advantages, like the ability to perform extensive database searches with declarative SQL language, protect data against hardware damages through regular backup mechanisms, and secure data against unauthorized access. Since relational databases do not provide exploration methods specific for protein data and its biological semantics, like searches on the basis of protein structural patterns, the use of relational databases in this domain is still rare and requires the development of dedicated methods to increase the speed of data exploration techniques. In this paper, we show a novel data partitioning scheme for distributing data across database clusters that can be used for performing sophisticated explorations of 3D protein structures. The data partitioning scheme relies on protein construction, which requires data preprocessing but results in shorter exploration times through querying federated databases. We solve the problem of finding proteins in Oracle relational database on the basis of the similarity of 3D protein structures with the use of distributed PAR-P3D-SQL queries. Since 3D protein structure similarity searching is one of the most time-consuming exploration processes that can be performed for protein data, we make use of a distributed environment of Oracle federated databases, distributed query processing, and dedicated load balancing methods to accelerate the exploration. Results of performed tests confirm that we are able to significantly increase the speed of the exploration process, proportionally to the number of database nodes in the federated environment.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Computational Biology / methods*
Databases, Protein*
Models, Molecular*
Protein Conformation
Proteins* / chemistry
Proteins* / ultrastructure
Sequence Alignment / methods*
Sequence Analysis, Protein / methods

Substances

Proteins