Divide and conquer! Data-mining tools and sequential multivariate analysis to search for diagnostic morphological characters within a plant polyploid complex (Veronica subsect. Pentasepalae, Plantaginaceae)

PLoS One. 2018 Jun 29;13(6):e0199818. doi: 10.1371/journal.pone.0199818. eCollection 2018.

Abstract

This study exhaustively explores leaf features seeking diagnostic characters to aid the classification (assigning cases to groups, i.e. populations to taxa) in a polyploid plant-species complex. A challenging case study was selected: Veronica subsection Pentasepalae, a taxonomically intricate group. The "divide and conquer" approach was implemented-that is, a difficult primary dataset was split into more manageable subsets. Three techniques were explored: two data-mining tools (artificial neural networks and decision trees) and one unsupervised discriminant analysis. However, only the decision trees and discriminant analysis were finally used to select diagnostic traits. A previously established classification hypothesis based on other data sources was used as a starting point. A guided discriminant analysis (i.e. involving manual character selection) was used to produce a grouping scheme fitting this hypothesis so that it could be taken as a reference. Sequential unsupervised multivariate analysis enabled the recognition of all species and infraspecific taxa; however, a suboptimal classification rate was achieved. Decision trees resulted in better classification rates than unsupervised multivariate analysis, but three complete taxa were misidentified (not present in terminal nodes). The variable selection led to a different grouping scheme in the case of decision trees. The resulting groups displayed low misclassification rates when analyzed using artificial neural networks. The decision trees as well as the discriminant analysis are recommended in the search of diagnostic characters. Due to the high sensitivity that artificial neural networks have to the combination of input/output layers, they are proposed as evaluation tools for morphometric studies. The "divide and conquer" principle is a promising strategy, providing success in the present case study.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Data Mining / methods*
  • Neural Networks, Computer*
  • Polyploidy*
  • Veronica / classification*
  • Veronica / genetics*

Grants and funding

This research was financially supported by the Spanish Ministry of Research, Development and Innovation through the projects [CGL2012-32574], [CGL2009-07555] and [CGL2014-52787-C3-2-P], http://www.idi.mineco.gob.es/portal/site/MICINN/?lang_choosen=en; Spanish Ministry of Research, Development and Innovation through PhD scholarships to NLG [AP2010-2968] and BMRA [AP2008-03434]. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.