Structural Outlier Detection and Zernike-Canterakis Moments for Molecular Surface Meshes-Fast Implementation in Python

Molecules. 2023 Dec 21;29(1):52. doi: 10.3390/molecules29010052.

Abstract

Object retrieval systems measure the degree of similarity of the shape of 3D models. They search for the elements of the 3D model databases that resemble the query model. In structural bioinformatics, the query model is a protein tertiary/quaternary structure and the objective is to find similarly shaped molecules in the Protein Data Bank. With the ever-growing size of the PDB, a direct atomic coordinate comparison with all its members is impractical. To overcome this problem, the shape of the molecules can be encoded by fixed-length feature vectors. The distance of a protein to the entire PDB can be measured in this low-dimensional domain in linear time. The state-of-the-art approaches utilize Zernike-Canterakis moments for the shape encoding and supply the retrieval process with geometric data of the input structures. The BioZernike descriptors are a standard utility of the PDB since 2020. However, when trying to calculate the ZC moments locally, the issue of the deficiency of libraries readily available for use in custom programs (i.e., without relying on external binaries) is encountered, in particular programs written in Python. Here, a fast and well-documented Python implementation of the Pozo-Koehl algorithm is presented. In contrast to the more popular algorithm by Novotni and Klein, which is based on the voxelized volume, the PK algorithm produces ZC moments directly from the triangular surface meshes of 3D models. In particular, it can accept the molecular surfaces of proteins as its input. In the presented PK-Zernike library, owing to Numba's just-in-time compilation, a mesh with 50,000 facets is processed by a single thread in a second at the moment order 20. Since this is the first time the PK algorithm is used in structural bioinformatics, it is employed in a novel, simple, but efficient protein structure retrieval pipeline. The elimination of the outlying chain fragments via a fast PCA-based subroutine improves the discrimination ability, allowing for this pipeline to achieve an 0.961 area under the ROC curve in the BioZernike validation suite (0.997 for the assemblies). The correlation between the results of the proposed approach and of the 3D Surfer program attains values up to 0.99.

Keywords: Numba; Python; Zernike moments; bioinformatics; computational geometry; molecular surface; principal component analysis; protein structure; shape retrieval.