Reliable genomic strategies for species classification of plant genetic resources

Artur van Bemmelen van der Plaat; Rob van Treuren; Theo J L van Hintum

doi:10.1186/s12859-021-04018-6

Reliable genomic strategies for species classification of plant genetic resources

BMC Bioinformatics. 2021 Mar 31;22(1):173. doi: 10.1186/s12859-021-04018-6.

Authors

Artur van Bemmelen van der Plaat¹, Rob van Treuren², Theo J L van Hintum²

Affiliations

¹ Centre for Genetic Resources, Wageningen University and Research, P.O. Box 16, 6700 AA, Wageningen, The Netherlands. artur.vanbemmelen@wur.nl.
² Centre for Genetic Resources, Wageningen University and Research, P.O. Box 16, 6700 AA, Wageningen, The Netherlands.

Abstract

Background: To address the need for easy and reliable species classification in plant genetic resources collections, we assessed the potential of five classifiers (Random Forest, Neighbour-Joining, 1-Nearest Neighbour, a conservative variety of 3-Nearest Neighbours and Naive Bayes) We investigated the effects of the number of accessions per species and misclassification rate on classification success, and validated theirs generic value results with three complete datasets.

Results: We found the conservative variety of 3-Nearest Neighbours to be the most reliable classifier when varying species representation and misclassification rate. Through the analysis of the three complete datasets, this finding showed generic value. Additionally, we present various options for marker selection for classification taks such as these.

Conclusions: Large-scale genomic data are increasingly being produced for genetic resources collections. These data are useful to address species classification issues regarding crop wild relatives, and improve genebank documentation. Implementation of a classification method that can improve the quality of bad datasets without gold standard training data is considered an innovative and efficient method to improve gene bank documentation.

Keywords: Crop wild relatives; Gene bank documentation; Genomics; Machine learning; Plant genetic resources; Species classification.

MeSH terms

Bayes Theorem
Cluster Analysis
Genomics*
Plants* / genetics

Abstract

MeSH terms

Grants and funding