Machine Learning Applications and Optimization of Clustering Methods Improve the Selection of Descriptors in Blackberry Germplasm Banks

Plants (Basel). 2021 Jan 28;10(2):247. doi: 10.3390/plants10020247.

Abstract

Machine learning (ML) and its multiple applications have comparative advantages for improving the interpretation of knowledge on different agricultural processes. However, there are challenges that impede proper usage, as can be seen in phenotypic characterizations of germplasm banks. The objective of this research was to test and optimize different analysis methods based on ML for the prioritization and selection of morphological descriptors of Rubus spp. 55 descriptors were evaluated in 26 genotypes and the weight of each one and its ability to discriminating capacity was determined. ML methods as random forest (RF), support vector machines, in the linear and radial forms, and neural networks were optimized and compared. Subsequently, the results were validated with two discriminating methods and their variants: hierarchical agglomerative clustering and K-means. The results indicated that RF presented the highest accuracy (0.768) of the methods evaluated, selecting 11 descriptors based on the purity (Gini index), importance, number of connected trees, and significance (p value < 0.05). Additionally, K-means method with optimized descriptors based on RF had greater discriminating power on Rubus spp., accessions according to evaluated statistics. This study presents one application of ML for the optimization of specific morphological variables for plant germplasm bank characterization.

Keywords: K-means; data science; digital agriculture; morphological descriptors; random forest.