Learning vector quantization as an interpretable classifier for the detection of SARS-CoV-2 types based on their RNA sequences

Marika Kaden; Katrin Sophie Bohnsack; Mirko Weber; Mateusz Kudła; Kaja Gutowska; Jacek Blazewicz; Thomas Villmann

doi:10.1007/s00521-021-06018-2

Learning vector quantization as an interpretable classifier for the detection of SARS-CoV-2 types based on their RNA sequences

Neural Comput Appl. 2022;34(1):67-78. doi: 10.1007/s00521-021-06018-2. Epub 2021 Apr 27.

Authors

Marika Kaden^#^{1

2}, Katrin Sophie Bohnsack^#^{1

2}, Mirko Weber^#^{1

2}, Mateusz Kudła^#^{1

3}, Kaja Gutowska^#^{3

4

5}, Jacek Blazewicz^#^{3

4

5}, Thomas Villmann^#^{1

2}

Affiliations

¹ University of Applied Sciences Mittweida, Technikumplatz 17, 09648 Mittweida, Germany.
² Saxon Institute for Computational Intelligence and Machine Learning, Technikumplatz 17, 09648 Mittweida, Germany.
³ Institute of Computing Science, Poznan University of Technology, Piotrowo 2, 60-965 Poznan, Poland.
⁴ Institute of Bioorganic Chemistry, Polish Academy of Sciences, Noskowskiego 12/14, 61-704 Poznan, Poland.
⁵ European Centre for Bioinformatics and Genomics, Piotrowo 2, 60-965 Poznan, Poland.

^# Contributed equally.

Abstract

We present an approach to discriminate SARS-CoV-2 virus types based on their RNA sequence descriptions avoiding a sequence alignment. For that purpose, sequences are preprocessed by feature extraction and the resulting feature vectors are analyzed by prototype-based classification to remain interpretable. In particular, we propose to use variants of learning vector quantization (LVQ) based on dissimilarity measures for RNA sequence data. The respective matrix LVQ provides additional knowledge about the classification decisions like discriminant feature correlations and, additionally, can be equipped with easy to realize reject options for uncertain data. Those options provide self-controlled evidence, i.e., the model refuses to make a classification decision if the model evidence for the presented data is not sufficient. This model is first trained using a GISAID dataset with given virus types detected according to the molecular differences in coronavirus populations by phylogenetic tree clustering. In a second step, we apply the trained model to another but unlabeled SARS-CoV-2 virus dataset. For these data, we can either assign a virus type to the sequences or reject atypical samples. Those rejected sequences allow to speculate about new virus types with respect to nucleotide base mutations in the viral sequences. Moreover, this rejection analysis improves model robustness. Last but not least, the presented approach has lower computational complexity compared to methods based on (multiple) sequence alignment.

Supplementary information: The online version contains supplementary material available at 10.1007/s00521-021-06018-2.

Keywords: Interpretable models; Genomic sequence analysis; Learning vector quantization; Reject options.